Computer Vision and Pattern Recognition 147
☆ FrugalNeRF: Fast Convergence for Few-shot Novel View Synthesis without Learned Priors
Neural Radiance Fields (NeRF) face significant challenges in few-shot
scenarios, primarily due to overfitting and long training times for
high-fidelity rendering. Existing methods, such as FreeNeRF and SparseNeRF, use
frequency regularization or pre-trained priors but struggle with complex
scheduling and bias. We introduce FrugalNeRF, a novel few-shot NeRF framework
that leverages weight-sharing voxels across multiple scales to efficiently
represent scene details. Our key contribution is a cross-scale geometric
adaptation scheme that selects pseudo ground truth depth based on reprojection
errors across scales. This guides training without relying on externally
learned priors, enabling full utilization of the training data. It can also
integrate pre-trained priors, enhancing quality without slowing convergence.
Experiments on LLFF, DTU, and RealEstate-10K show that FrugalNeRF outperforms
other few-shot NeRF methods while significantly reducing training time, making
it a practical solution for efficient and accurate 3D scene reconstruction.
comment: Project page: https://linjohnss.github.io/frugalnerf/
☆ MvDrag3D: Drag-based Creative 3D Editing via Multi-view Generation-Reconstruction Priors
Drag-based editing has become popular in 2D content creation, driven by the
capabilities of image generative models. However, extending this technique to
3D remains a challenge. Existing 3D drag-based editing methods, whether
employing explicit spatial transformations or relying on implicit latent
optimization within limited-capacity 3D generative models, fall short in
handling significant topology changes or generating new textures across diverse
object categories. To overcome these limitations, we introduce MVDrag3D, a
novel framework for more flexible and creative drag-based 3D editing that
leverages multi-view generation and reconstruction priors. At the core of our
approach is the usage of a multi-view diffusion model as a strong generative
prior to perform consistent drag editing over multiple rendered views, which is
followed by a reconstruction model that reconstructs 3D Gaussians of the edited
object. While the initial 3D Gaussians may suffer from misalignment between
different views, we address this via view-specific deformation networks that
adjust the position of Gaussians to be well aligned. In addition, we propose a
multi-view score function that distills generative priors from multiple views
to further enhance the view consistency and visual quality. Extensive
experiments demonstrate that MVDrag3D provides a precise, generative, and
flexible solution for 3D drag-based editing, supporting more versatile editing
effects across various object categories and 3D representations.
comment: 16 pages, 10 figures, conference
☆ SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree
Shuangrui Ding, Rui Qian, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Yuwei Guo, Dahua Lin, Jiaqi Wang
The Segment Anything Model 2 (SAM 2) has emerged as a powerful foundation
model for object segmentation in both images and videos, paving the way for
various downstream video applications. The crucial design of SAM 2 for video
segmentation is its memory module, which prompts object-aware memories from
previous frames for current frame prediction. However, its greedy-selection
memory design suffers from the "error accumulation" problem, where an errored
or missed mask will cascade and influence the segmentation of the subsequent
frames, which limits the performance of SAM 2 toward complex long-term videos.
To this end, we introduce SAM2Long, an improved training-free video object
segmentation strategy, which considers the segmentation uncertainty within each
frame and chooses the video-level optimal results from multiple segmentation
pathways in a constrained tree search manner. In practice, we maintain a fixed
number of segmentation pathways throughout the video. For each frame, multiple
masks are proposed based on the existing pathways, creating various candidate
branches. We then select the same fixed number of branches with higher
cumulative scores as the new pathways for the next frame. After processing the
final frame, the pathway with the highest cumulative score is chosen as the
final segmentation result. Benefiting from its heuristic search design,
SAM2Long is robust toward occlusions and object reappearances, and can
effectively segment and track objects for complex long-term videos. Notably,
SAM2Long achieves an average improvement of 3.0 points across all 24
head-to-head comparisons, with gains of up to 5.3 points in J&F on long-term
video object segmentation benchmarks such as SA-V and LVOS. The code is
released at https://github.com/Mark12Ding/SAM2Long.
comment: Project page: https://mark12ding.github.io/project/SAM2Long/
☆ xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs
Michael S. Ryoo, Honglu Zhou, Shrikant Kendre, Can Qin, Le Xue, Manli Shu, Silvio Savarese, Ran Xu, Caiming Xiong, Juan Carlos Niebles
We present xGen-MM-Vid (BLIP-3-Video): a multimodal language model for
videos, particularly designed to efficiently capture temporal information over
multiple frames. BLIP-3-Video takes advantage of the 'temporal encoder' in
addition to the conventional visual tokenizer, which maps a sequence of tokens
over multiple frames into a compact set of visual tokens. This enables
BLIP3-Video to use much fewer visual tokens than its competing models (e.g., 32
vs. 4608 tokens). We explore different types of temporal encoders, including
learnable spatio-temporal pooling as well as sequential models like Token
Turing Machines. We experimentally confirm that BLIP-3-Video obtains video
question-answering accuracies comparable to much larger state-of-the-art models
(e.g., 34B), while being much smaller (i.e., 4B) and more efficient by using
fewer visual tokens. The project website is at
https://www.salesforceairesearch.com/opensource/xGen-MM-Vid/index.html
☆ 3DGS-Enhancer: Enhancing Unbounded 3D Gaussian Splatting with View-consistent 2D Diffusion Priors NeurIPS 2024
Novel-view synthesis aims to generate novel views of a scene from multiple
input images or videos, and recent advancements like 3D Gaussian splatting
(3DGS) have achieved notable success in producing photorealistic renderings
with efficient pipelines. However, generating high-quality novel views under
challenging settings, such as sparse input views, remains difficult due to
insufficient information in under-sampled areas, often resulting in noticeable
artifacts. This paper presents 3DGS-Enhancer, a novel pipeline for enhancing
the representation quality of 3DGS representations. We leverage 2D video
diffusion priors to address the challenging 3D view consistency problem,
reformulating it as achieving temporal consistency within a video generation
process. 3DGS-Enhancer restores view-consistent latent features of rendered
novel views and integrates them with the input views through a spatial-temporal
decoder. The enhanced views are then used to fine-tune the initial 3DGS model,
significantly improving its rendering performance. Extensive experiments on
large-scale datasets of unbounded scenes demonstrate that 3DGS-Enhancer yields
superior reconstruction performance and high-fidelity rendering results
compared to state-of-the-art methods. The project webpage is
https://xiliu8006.github.io/3DGS-Enhancer-project .
comment: Accepted by NeurIPS 2024 Spotlight
☆ Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance
Zhangwei Gao, Zhe Chen, Erfei Cui, Yiming Ren, Weiyun Wang, Jinguo Zhu, Hao Tian, Shenglong Ye, Junjun He, Xizhou Zhu, Lewei Lu, Tong Lu, Yu Qiao, Jifeng Dai, Wenhai Wang
Multimodal large language models (MLLMs) have demonstrated impressive
performance in vision-language tasks across a broad spectrum of domains.
However, the large model scale and associated high computational costs pose
significant challenges for training and deploying MLLMs on consumer-grade GPUs
or edge devices, thereby hindering their widespread application. In this work,
we introduce Mini-InternVL, a series of MLLMs with parameters ranging from 1B
to 4B, which achieves 90% of the performance with only 5% of the parameters.
This significant improvement in efficiency and effectiveness makes our models
more accessible and applicable in various real-world scenarios. To further
promote the adoption of our models, we develop a unified adaptation framework
for Mini-InternVL, which enables our models to transfer and outperform
specialized models in downstream tasks, including autonomous driving, medical
images, and remote sensing. We believe that our study can provide valuable
insights and resources to advance the development of efficient and effective
MLLMs. Code is available at https://github.com/OpenGVLab/InternVL.
comment: Technical report
☆ Agent-to-Sim: Learning Interactive Behavior Models from Casual Longitudinal Videos
We present Agent-to-Sim (ATS), a framework for learning interactive behavior
models of 3D agents from casual longitudinal video collections. Different from
prior works that rely on marker-based tracking and multiview cameras, ATS
learns natural behaviors of animal and human agents non-invasively through
video observations recorded over a long time-span (e.g., a month) in a single
environment. Modeling 3D behavior of an agent requires persistent 3D tracking
(e.g., knowing which point corresponds to which) over a long time period. To
obtain such data, we develop a coarse-to-fine registration method that tracks
the agent and the camera over time through a canonical 3D space, resulting in a
complete and persistent spacetime 4D representation. We then train a generative
model of agent behaviors using paired data of perception and motion of an agent
queried from the 4D reconstruction. ATS enables real-to-sim transfer from video
recordings of an agent to an interactive behavior simulator. We demonstrate
results on pets (e.g., cat, dog, bunny) and human given monocular RGBD videos
captured by a smartphone.
comment: Project page: https://gengshan-y.github.io/agent2sim-www/
☆ Elucidating the design space of language models for image generation
The success of autoregressive (AR) language models in text generation has
inspired the computer vision community to adopt Large Language Models (LLMs)
for image generation. However, considering the essential differences between
text and image modalities, the design space of language models for image
generation remains underexplored. We observe that image tokens exhibit greater
randomness compared to text tokens, which presents challenges when training
with token prediction. Nevertheless, AR models demonstrate their potential by
effectively learning patterns even from a seemingly suboptimal optimization
problem. Our analysis also reveals that while all models successfully grasp the
importance of local information in image generation, smaller models struggle to
capture the global context. In contrast, larger models showcase improved
capabilities in this area, helping to explain the performance gains achieved
when scaling up model size. We further elucidate the design space of language
models for vision generation, including tokenizer choice, model choice, model
scalability, vocabulary design, and sampling strategy through extensive
comparative experiments. Our work is the first to analyze the optimization
behavior of language models in vision generation, and we believe it can inspire
more effective designs when applying LMs to other domains. Finally, our
elucidated language model for image generation, termed as ELM, achieves
state-of-the-art performance on the ImageNet 256*256 benchmark. The code is
available at https://github.com/Pepperlll/LMforImageGeneration.git.
comment: Project page: https://pepper-lll.github.io/LMforImageGeneration/
☆ Revisiting Deep Feature Reconstruction for Logical and Structural Industrial Anomaly Detection
Industrial anomaly detection is crucial for quality control and predictive
maintenance, but it presents challenges due to limited training data, diverse
anomaly types, and external factors that alter object appearances. Existing
methods commonly detect structural anomalies, such as dents and scratches, by
leveraging multi-scale features from image patches extracted through deep
pre-trained networks. However, significant memory and computational demands
often limit their practical application. Additionally, detecting logical
anomalies-such as images with missing or excess elements-requires an
understanding of spatial relationships that traditional patch-based methods
fail to capture. In this work, we address these limitations by focusing on Deep
Feature Reconstruction (DFR), a memory- and compute-efficient approach for
detecting structural anomalies. We further enhance DFR into a unified
framework, called ULSAD, which is capable of detecting both structural and
logical anomalies. Specifically, we refine the DFR training objective to
improve performance in structural anomaly detection, while introducing an
attention-based loss mechanism using a global autoencoder-like network to
handle logical anomaly detection. Our empirical evaluation across five
benchmark datasets demonstrates the performance of ULSAD in detecting and
localizing both structural and logical anomalies, outperforming eight
state-of-the-art methods. An extensive ablation study further highlights the
contribution of each component to the overall performance improvement. Our code
is available at https://github.com/sukanyapatra1997/ULSAD-2024.git
comment: Accepted in Transactions on Machine Learning Research (TMLR). Link to
OpenReview: https://openreview.net/forum?id=kdTC4ktHPD
☆ MoRE: Multi-Modal Contrastive Pre-training with Transformers on X-Rays, ECGs, and Diagnostic Report
In this paper, we introduce a novel Multi-Modal Contrastive Pre-training
Framework that synergistically combines X-rays, electrocardiograms (ECGs), and
radiology/cardiology reports. Our approach leverages transformers to encode
these diverse modalities into a unified representation space, aiming to enhance
diagnostic accuracy and facilitate comprehensive patient assessments. We
utilize LoRA-Peft to significantly reduce trainable parameters in the LLM and
incorporate recent linear attention dropping strategy in the Vision
Transformer(ViT) for smoother attention. Furthermore, we provide novel
multimodal attention explanations and retrieval for our model. To the best of
our knowledge, we are the first to propose an integrated model that combines
X-ray, ECG, and Radiology/Cardiology Report with this approach. By utilizing
contrastive loss, MoRE effectively aligns modality-specific features into a
coherent embedding, which supports various downstream tasks such as zero-shot
classification and multimodal retrieval. Employing our proposed methodology, we
achieve state-of-the-art (SOTA) on the Mimic-IV, CheXpert, Edema Severity, and
PtbXl downstream datasets, surpassing existing multimodal approaches. Our
proposed framework shows significant improvements in capturing intricate
inter-modal relationships and its robustness in medical diagnosis that
establishes a framework for future research in multimodal learning in the
healthcare sector.
comment: 10 pages, 5 figures, 9 tables. Supplementary detail in Appendix. Code
made available in Github for reproducibility
☆ Deep Radiomics Detection of Clinically Significant Prostate Cancer on Multicenter MRI: Initial Comparison to PI-RADS Assessment
G. A. Nketiah, M. R. Sunoqrot, E. Sandsmark, S. Langørgen, K. M. Selnæs, H. Bertilsson, M. Elschot, T. F. Bathen
Objective: To develop and evaluate a deep radiomics model for clinically
significant prostate cancer (csPCa, grade group >= 2) detection and compare its
performance to Prostate Imaging Reporting and Data System (PI-RADS) assessment
in a multicenter cohort. Materials and Methods: This retrospective study
analyzed biparametric (T2W and DW) prostate MRI sequences of 615 patients (mean
age, 63.1 +/- 7 years) from four datasets acquired between 2010 and 2020:
PROSTATEx challenge, Prostate158 challenge, PCaMAP trial, and an in-house
(NTNU/St. Olavs Hospital) dataset. With expert annotations as ground truth, a
deep radiomics model was trained, including nnU-Net segmentation of the
prostate gland, voxel-wise radiomic feature extraction, extreme gradient boost
classification, and post-processing of tumor probability maps into csPCa
detection maps. Training involved 5-fold cross-validation using the PROSTATEx
(n=199), Prostate158 (n=138), and PCaMAP (n=78) datasets, and testing on the
in-house (n=200) dataset. Patient- and lesion-level performance were compared
to PI-RADS using area under ROC curve (AUROC [95% CI]), sensitivity, and
specificity analysis. Results: On the test data, the radiologist achieved a
patient-level AUROC of 0.94 [0.91-0.98] with 94% (75/80) sensitivity and 77%
(92/120) specificity at PI-RADS >= 3. The deep radiomics model at a tumor
probability cut-off >= 0.76 achieved 0.91 [0.86-0.95] AUROC with 90% (72/80)
sensitivity and 73% (87/120) specificity, not significantly different (p =
0.068) from PI-RADS. On the lesion level, PI-RADS cut-off >= 3 had 84% (91/108)
sensitivity at 0.2 (40/200) false positives per patient, while deep radiomics
attained 68% (73/108) sensitivity at the same false positive rate. Conclusion:
Deep radiomics machine learning model achieved comparable performance to
PI-RADS assessment in csPCa detection at the patient-level but not at the
lesion-level.
comment: 20 pages, 4 figures, 4 tables
☆ LLaVA-KD: A Framework of Distilling Multimodal Large Language Models
The success of Large Language Models (LLM) has led researchers to explore
Multimodal Large Language Models (MLLM) for unified visual and linguistic
understanding. However, the increasing model size and computational complexity
of MLLM limit their use in resource-constrained environments. Small-scale MLLM
(s-MLLM) aims to retain the capabilities of the large-scale model (l-MLLM)
while reducing computational demands, but resulting in a significant decline in
performance. To address the aforementioned issues, we propose a novel LLaVA-KD
framework to transfer knowledge from l-MLLM to s-MLLM. Specifically, we
introduce Multimodal Distillation (MDist) to minimize the divergence between
the visual-textual output distributions of l-MLLM and s-MLLM, and Relation
Distillation (RDist) to transfer l-MLLM's ability to model correlations between
visual features. Additionally, we propose a three-stage training scheme to
fully exploit the potential of s-MLLM: 1) Distilled Pre-Training to align
visual-textual representations, 2) Supervised Fine-Tuning to equip the model
with multimodal understanding, and 3) Distilled Fine-Tuning to further transfer
l-MLLM capabilities. Our approach significantly improves performance without
altering the small model's architecture. Extensive experiments and ablation
studies validate the effectiveness of each proposed component. Code will be
available at https://github.com/caiyuxuan1120/LLaVA-KD.
comment: Under review
☆ Managing Bandwidth: The Key to Cloud-Assisted Autonomous Driving
Alexander Krentsel, Peter Schafhalter, Joseph E. Gonzalez, Sylvia Ratnasamy, Scott Shenker, Ion Stoica
Prevailing wisdom asserts that one cannot rely on the cloud for critical
real-time control systems like self-driving cars. We argue that we can, and
must. Following the trends of increasing model sizes, improvements in hardware,
and evolving mobile networks, we identify an opportunity to offload parts of
time-sensitive and latency-critical compute to the cloud. Doing so requires
carefully allocating bandwidth to meet strict latency SLOs, while maximizing
benefit to the car.
comment: 6 pages
☆ Improve Vision Language Model Chain-of-thought Reasoning
Ruohong Zhang, Bowen Zhang, Yanghao Li, Haotian Zhang, Zhiqing Sun, Zhe Gan, Yinfei Yang, Ruoming Pang, Yiming Yang
Chain-of-thought (CoT) reasoning in vision language models (VLMs) is crucial
for improving interpretability and trustworthiness. However, current training
recipes lack robust CoT reasoning data, relying on datasets dominated by short
annotations with minimal rationales. In this work, we show that training VLM on
short answers does not generalize well to reasoning tasks that require more
detailed responses. To address this, we propose a two-fold approach. First, we
distill rationales from GPT-4o model to enrich the training data and fine-tune
VLMs, boosting their CoT performance. Second, we apply reinforcement learning
to further calibrate reasoning quality. Specifically, we construct positive
(correct) and negative (incorrect) pairs of model-generated reasoning chains,
by comparing their predictions with annotated short answers. Using this
pairwise data, we apply the Direct Preference Optimization algorithm to refine
the model's reasoning abilities. Our experiments demonstrate significant
improvements in CoT reasoning on benchmark datasets and better generalization
to direct answer prediction as well. This work emphasizes the importance of
incorporating detailed rationales in training and leveraging reinforcement
learning to strengthen the reasoning capabilities of VLMs.
comment: 10 pages + appendix
☆ Training Better Deep Learning Models Using Human Saliency
This work explores how human judgement about salient regions of an image can
be introduced into deep convolutional neural network (DCNN) training.
Traditionally, training of DCNNs is purely data-driven. This often results in
learning features of the data that are only coincidentally correlated with
class labels. Human saliency can guide network training using our proposed new
component of the loss function that ConveYs Brain Oversight to Raise
Generalization (CYBORG) and penalizes the model for using non-salient regions.
This mechanism produces DCNNs achieving higher accuracy and generalization
compared to using the same training data without human salience. Experimental
results demonstrate that CYBORG applies across multiple network architectures
and problem domains (detection of synthetic faces, iris presentation attacks
and anomalies in chest X-rays), while requiring significantly less data than
training without human saliency guidance. Visualizations show that
CYBORG-trained models' saliency is more consistent across independent training
runs than traditionally-trained models, and also in better agreement with
humans. To lower the cost of collecting human annotations, we also explore
using deep learning to provide automated annotations. CYBORG training of CNNs
addresses important issues such as reducing the appetite for large training
sets, increasing interpretability, and reducing fragility by generalizing
better to new types of data.
☆ A Framework for Evaluating Predictive Models Using Synthetic Image Covariates and Longitudinal Data
We present a novel framework for synthesizing patient data with complex
covariates (e.g., eye scans) paired with longitudinal observations (e.g.,
visual acuity over time), addressing privacy concerns in healthcare research.
Our approach introduces controlled association in latent spaces generating each
data modality, enabling the creation of complex covariate-longitudinal
observation pairs. This framework facilitates the development of predictive
models and provides openly available benchmarking datasets for healthcare
research. We demonstrate our framework using optical coherence tomography (OCT)
scans, though it is applicable across domains. Using 109,309 2D OCT scan
slices, we trained an image generative model combining a variational
autoencoder and a diffusion model. Longitudinal observations were simulated
using a nonlinear mixed effect (NLME) model from a low-dimensional space of
random effects. We generated 1.1M OCT scan slices paired with five sets of
longitudinal observations at controlled association levels (100%, 50%, 10%,
5.26%, and 2% of between-subject variability). To assess the framework, we
modeled synthetic longitudinal observations with another NLME model, computed
empirical Bayes estimates of random effects, and trained a ResNet to predict
these estimates from synthetic OCT scans. We then incorporated ResNet
predictions into the NLME model for patient-individualized predictions.
Prediction accuracy on withheld data declined as intended with reduced
association between images and longitudinal measurements. Notably, in all but
the 2% case, we achieved within 50% of the theoretical best possible prediction
on withheld data, demonstrating our ability to detect even weak signals. This
confirms the effectiveness of our framework in generating synthetic data with
controlled levels of association, providing a valuable tool for healthcare
research.
☆ Beyond Filtering: Adaptive Image-Text Quality Enhancement for MLLM Pretraining
Han Huang, Yuqi Huo, Zijia Zhao, Haoyu Lu, Shu Wu, Bingning Wang, Qiang Liu, Weipeng Chen, Liang Wang
Multimodal large language models (MLLMs) have made significant strides by
integrating visual and textual modalities. A critical factor in training MLLMs
is the quality of image-text pairs within multimodal pretraining datasets.
However, $\textit {de facto}$ filter-based data quality enhancement paradigms
often discard a substantial portion of high-quality image data due to
inadequate semantic alignment between images and texts, leading to
inefficiencies in data utilization and scalability. In this paper, we propose
the Adaptive Image-Text Quality Enhancer (AITQE), a model that dynamically
assesses and enhances the quality of image-text pairs. AITQE employs a text
rewriting mechanism for low-quality pairs and incorporates a negative sample
learning strategy to improve evaluative capabilities by integrating
deliberately selected low-quality samples during training. Unlike prior
approaches that significantly alter text distributions, our method minimally
adjusts text to preserve data volume while enhancing quality. Experimental
results demonstrate that AITQE surpasses existing methods on various benchmark,
effectively leveraging raw data and scaling efficiently with increasing data
volumes. We hope our work will inspire future works. The code and model are
available at: https://github.com/hanhuang22/AITQE.
☆ Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large Multimodal Models
Large Multimodal Models (LMMs) have achieved significant breakthroughs in
various vision-language and vision-centric tasks based on auto-regressive
modeling. However, these models typically focus on either vision-centric tasks,
such as visual grounding and region description, or vision-language tasks, like
image caption and multi-scenario VQAs. None of the LMMs have yet
comprehensively unified both types of tasks within a single model, as seen in
Large Language Models in the natural language processing field. Furthermore,
even with abundant multi-task instruction-following data, directly stacking
these data for universal capabilities extension remains challenging. To address
these issues, we introduce a novel multi-dimension curated and consolidated
multimodal dataset, named CCMD-8M, which overcomes the data barriers of
unifying vision-centric and vision-language tasks through multi-level data
curation and multi-task consolidation. More importantly, we present Griffon-G,
a general large multimodal model that addresses both vision-centric and
vision-language tasks within a single end-to-end paradigm. Griffon-G resolves
the training collapse issue encountered during the joint optimization of these
tasks, achieving better training efficiency. Evaluations across multimodal
benchmarks, general Visual Question Answering (VQA) tasks, scene text-centric
VQA tasks, document-related VQA tasks, Referring Expression Comprehension, and
object detection demonstrate that Griffon-G surpasses the advanced LMMs and
achieves expert-level performance in complicated vision-centric tasks.
comment: This work has been submitted to the IEEE for possible publication.
Codes and data will be later released at
https://github.com/jefferyZhan/Griffon
☆ Sparkle: Mastering Basic Spatial Capabilities in Vision Language Models Elicits Generalization to Composite Spatial Reasoning
Yihong Tang, Ao Qu, Zhaokai Wang, Dingyi Zhuang, Zhaofeng Wu, Wei Ma, Shenhao Wang, Yunhan Zheng, Zhan Zhao, Jinhua Zhao
Vision language models (VLMs) have demonstrated impressive performance across
a wide range of downstream tasks. However, their proficiency in spatial
reasoning remains limited, despite its crucial role in tasks involving
navigation and interaction with physical environments. Specifically, much of
the spatial reasoning in these tasks occurs in two-dimensional (2D)
environments, and our evaluation reveals that state-of-the-art VLMs frequently
generate implausible and incorrect responses to composite spatial reasoning
problems, including simple pathfinding tasks that humans can solve effortlessly
at a glance. To address this, we explore an effective approach to enhance 2D
spatial reasoning within VLMs by training the model on basic spatial
capabilities. We begin by disentangling the key components of 2D spatial
reasoning: direction comprehension, distance estimation, and localization. Our
central hypothesis is that mastering these basic spatial capabilities can
significantly enhance a model's performance on composite spatial tasks
requiring advanced spatial understanding and combinatorial problem-solving. To
investigate this hypothesis, we introduce Sparkle, a framework that fine-tunes
VLMs on these three basic spatial capabilities by synthetic data generation and
targeted supervision to form an instruction dataset for each capability. Our
experiments demonstrate that VLMs fine-tuned with Sparkle achieve significant
performance gains, not only in the basic tasks themselves but also in
generalizing to composite and out-of-distribution spatial reasoning tasks
(e.g., improving from 13.5% to 40.0% on the shortest path problem). These
findings underscore the effectiveness of mastering basic spatial capabilities
in enhancing composite spatial problem-solving, offering insights for improving
VLMs' spatial reasoning capabilities.
☆ Metric as Transform: Exploring beyond Affine Transform for Interpretable Neural Network
Artificial Neural Networks of varying architectures are generally paired with
affine transformation at the core. However, we find dot product neurons with
global influence less interpretable as compared to local influence of euclidean
distance (as used in Radial Basis Function Network). In this work, we explore
the generalization of dot product neurons to $l^p$-norm, metrics, and beyond.
We find that metrics as transform performs similarly to affine transform when
used in MultiLayer Perceptron or Convolutional Neural Network. Moreover, we
explore various properties of Metrics, compare it with Affine, and present
multiple cases where metrics seem to provide better interpretability. We
develop an interpretable local dictionary based Neural Networks and use it to
understand and reject adversarial examples.
comment: 22 pages, 20 figures, 3 tables
☆ Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages
Xiang Yue, Yueqi Song, Akari Asai, Seungone Kim, Jean de Dieu Nyandwi, Simran Khanuja, Anjali Kantharuban, Lintang Sutawika, Sathyanarayanan Ramamoorthy, Graham Neubig
Despite recent advances in multimodal large language models (MLLMs), their
development has predominantly focused on English- and western-centric datasets
and tasks, leaving most of the world's languages and diverse cultural contexts
underrepresented. This paper introduces Pangea, a multilingual multimodal LLM
trained on PangeaIns, a diverse 6M instruction dataset spanning 39 languages.
PangeaIns features: 1) high-quality English instructions, 2) carefully
machine-translated instructions, and 3) culturally relevant multimodal tasks to
ensure cross-cultural coverage. To rigorously assess models' capabilities, we
introduce PangeaBench, a holistic evaluation suite encompassing 14 datasets
covering 47 languages. Results show that Pangea significantly outperforms
existing open-source models in multilingual settings and diverse cultural
contexts. Ablation studies further reveal the importance of English data
proportions, language popularity, and the number of multimodal training samples
on overall performance. We fully open-source our data, code, and trained
checkpoints, to facilitate the development of inclusive and robust multilingual
MLLMs, promoting equity and accessibility across a broader linguistic and
cultural spectrum.
comment: 52 pages, 27 figures
☆ Warped Diffusion: Solving Video Inverse Problems with Image Diffusion Models NeurIPS 2024
Giannis Daras, Weili Nie, Karsten Kreis, Alex Dimakis, Morteza Mardani, Nikola Borislavov Kovachki, Arash Vahdat
Using image models naively for solving inverse video problems often suffers
from flickering, texture-sticking, and temporal inconsistency in generated
videos. To tackle these problems, in this paper, we view frames as continuous
functions in the 2D space, and videos as a sequence of continuous warping
transformations between different frames. This perspective allows us to train
function space diffusion models only on images and utilize them to solve
temporally correlated inverse problems. The function space diffusion models
need to be equivariant with respect to the underlying spatial transformations.
To ensure temporal consistency, we introduce a simple post-hoc test-time
guidance towards (self)-equivariant solutions. Our method allows us to deploy
state-of-the-art latent diffusion models such as Stable Diffusion XL to solve
video inverse problems. We demonstrate the effectiveness of our method for
video inpainting and $8\times$ video super-resolution, outperforming existing
techniques based on noise transformations. We provide generated video results:
https://giannisdaras.github.io/warped\_diffusion.github.io/.
comment: Accepted in NeurIPS 2024
☆ Towards Combating Frequency Simplicity-biased Learning for Domain Generalization NeurIPS 2024
Xilin He, Jingyu Hu, Qinliang Lin, Cheng Luo, Weicheng Xie, Siyang Song, Muhammad Haris Khan, Linlin Shen
Domain generalization methods aim to learn transferable knowledge from source
domains that can generalize well to unseen target domains. Recent studies show
that neural networks frequently suffer from a simplicity-biased learning
behavior which leads to over-reliance on specific frequency sets, namely as
frequency shortcuts, instead of semantic information, resulting in poor
generalization performance. Despite previous data augmentation techniques
successfully enhancing generalization performances, they intend to apply more
frequency shortcuts, thereby causing hallucinations of generalization
improvement. In this paper, we aim to prevent such learning behavior of
applying frequency shortcuts from a data-driven perspective. Given the
theoretical justification of models' biased learning behavior on different
spatial frequency components, which is based on the dataset frequency
properties, we argue that the learning behavior on various frequency components
could be manipulated by changing the dataset statistical structure in the
Fourier domain. Intuitively, as frequency shortcuts are hidden in the dominant
and highly dependent frequencies of dataset structure, dynamically perturbating
the over-reliance frequency components could prevent the application of
frequency shortcuts. To this end, we propose two effective data augmentation
modules designed to collaboratively and adaptively adjust the frequency
characteristic of the dataset, aiming to dynamically influence the learning
behavior of the model and ultimately serving as a strategy to mitigate shortcut
learning. Code is available at AdvFrequency
(https://github.com/C0notSilly/AdvFrequency).
comment: Accepted by NeurIPS 2024
☆ An Explainable Contrastive-based Dilated Convolutional Network with Transformer for Pediatric Pneumonia Detection
Pediatric pneumonia remains a significant global threat, posing a larger
mortality risk than any other communicable disease. According to UNICEF, it is
a leading cause of mortality in children under five and requires prompt
diagnosis. Early diagnosis using chest radiographs is the prevalent standard,
but limitations include low radiation levels in unprocessed images and data
imbalance issues. This necessitates the development of efficient,
computer-aided diagnosis techniques. To this end, we propose a novel
EXplainable Contrastive-based Dilated Convolutional Network with Transformer
(XCCNet) for pediatric pneumonia detection. XCCNet harnesses the spatial power
of dilated convolutions and the global insights from contrastive-based
transformers for effective feature refinement. A robust chest X-ray processing
module tackles low-intensity radiographs, while adversarial-based data
augmentation mitigates the skewed distribution of chest X-rays in the dataset.
Furthermore, we actively integrate an explainability approach through feature
visualization, directly aligning it with the attention region that pinpoints
the presence of pneumonia or normality in radiographs. The efficacy of XCCNet
is comprehensively assessed on four publicly available datasets. Extensive
performance evaluation demonstrates the superiority of XCCNet compared to
state-of-the-art methods.
☆ Multimodal Flare Forecasting with Deep Learning
Solar flare forecasting mainly relies on photospheric magnetograms and
associated physical features to predict forthcoming flares. However, it is
believed that flare initiation mechanisms often originate in the chromosphere
and the lower corona. In this study, we employ deep learning as a purely
data-driven approach to compare the predictive capabilities of chromospheric
and coronal UV and EUV emissions across different wavelengths with those of
photospheric line-of-sight magnetograms. Our findings indicate that individual
EUV wavelengths can provide discriminatory power comparable or better to that
of line-of-sight magnetograms. Moreover, we identify simple multimodal neural
network architectures that consistently outperform single-input models, showing
complementarity between the flare precursors that can be extracted from the
distinct layers of the solar atmosphere. To mitigate potential biases from
known misattributions in Active Region flare catalogs, our models are trained
and evaluated using full-disk images and a comprehensive flare event catalog at
the full-disk level. We introduce a deep-learning architecture suited for
extracting temporal features from full-disk videos.
☆ Increasing Interpretability of Neural Networks By Approximating Human Visual Saliency
Understanding specifically where a model focuses on within an image is
critical for human interpretability of the decision-making process. Deep
learning-based solutions are prone to learning coincidental correlations in
training datasets, causing over-fitting and reducing the explainability. Recent
advances have shown that guiding models to human-defined regions of saliency
within individual images significantly increases performance and
interpretability. Human-guided models also exhibit greater generalization
capabilities, as coincidental dataset features are avoided. Results show that
models trained with saliency incorporation display an increase in
interpretability of up to 30% over models trained without saliency information.
The collection of this saliency information, however, can be costly, laborious
and in some cases infeasible. To address this limitation, we propose a
combination strategy of saliency incorporation and active learning to reduce
the human annotation data required by 80% while maintaining the
interpretability and performance increase from human saliency. Extensive
experimentation outlines the effectiveness of the proposed approach across five
public datasets and six active learning criteria.
☆ LMHaze: Intensity-aware Image Dehazing with a Large-scale Multi-intensity Real Haze Dataset
Image dehazing has drawn a significant attention in recent years.
Learning-based methods usually require paired hazy and corresponding ground
truth (haze-free) images for training. However, it is difficult to collect
real-world image pairs, which prevents developments of existing methods.
Although several works partially alleviate this issue by using synthetic
datasets or small-scale real datasets. The haze intensity distribution bias and
scene homogeneity in existing datasets limit the generalization ability of
these methods, particularly when encountering images with previously unseen
haze intensities. In this work, we present LMHaze, a large-scale, high-quality
real-world dataset. LMHaze comprises paired hazy and haze-free images captured
in diverse indoor and outdoor environments, spanning multiple scenarios and
haze intensities. It contains over 5K high-resolution image pairs, surpassing
the size of the biggest existing real-world dehazing dataset by over 25 times.
Meanwhile, to better handle images with different haze intensities, we propose
a mixture-of-experts model based on Mamba (MoE-Mamba) for dehazing, which
dynamically adjusts the model parameters according to the haze intensity.
Moreover, with our proposed dataset, we conduct a new large multimodal model
(LMM)-based benchmark study to simulate human perception for evaluating dehazed
images. Experiments demonstrate that LMHaze dataset improves the dehazing
performance in real scenarios and our dehazing method provides better results
compared to state-of-the-art methods.
☆ Final Report for CHESS: Cloud, High-Performance Computing, and Edge for Science and Security
Nathan Tallent, Jan Strube, Luanzheng Guo, Hyungro Lee, Jesun Firoz, Sayan Ghosh, Bo Fang, Oceane Bel, Steven Spurgeon, Sarah Akers, Christina Doty, Erol Cromwell
Automating the theory-experiment cycle requires effective distributed
workflows that utilize a computing continuum spanning lab instruments, edge
sensors, computing resources at multiple facilities, data sets distributed
across multiple information sources, and potentially cloud. Unfortunately, the
obvious methods for constructing continuum platforms, orchestrating workflow
tasks, and curating datasets over time fail to achieve scientific requirements
for performance, energy, security, and reliability. Furthermore, achieving the
best use of continuum resources depends upon the efficient composition and
execution of workflow tasks, i.e., combinations of numerical solvers, data
analytics, and machine learning. Pacific Northwest National Laboratory's LDRD
"Cloud, High-Performance Computing (HPC), and Edge for Science and Security"
(CHESS) has developed a set of interrelated capabilities for enabling
distributed scientific workflows and curating datasets. This report describes
the results and successes of CHESS from the perspective of open science.
☆ Integrated Image-Text Based on Semi-supervised Learning for Small Sample Instance Segmentation
Small sample instance segmentation is a very challenging task, and many
existing methods follow the training strategy of meta-learning which pre-train
models on support set and fine-tune on query set. The pre-training phase, which
is highly task related, requires a significant amount of additional training
time and the selection of datasets with close proximity to ensure
effectiveness. The article proposes a novel small sample instance segmentation
solution from the perspective of maximizing the utilization of existing
information without increasing annotation burden and training costs. The
proposed method designs two modules to address the problems encountered in
small sample instance segmentation. First, it helps the model fully utilize
unlabeled data by learning to generate pseudo labels, increasing the number of
available samples. Second, by integrating the features of text and image, more
accurate classification results can be obtained. These two modules are suitable
for box-free and box-dependent frameworks. In the way, the proposed method not
only improves the performance of small sample instance segmentation, but also
greatly reduce reliance on pre-training. We have conducted experiments in three
datasets from different scenes: on land, underwater and under microscope. As
evidenced by our experiments, integrated image-text corrects the confidence of
classification, and pseudo labels help the model obtain preciser masks. All the
results demonstrate the effectiveness and superiority of our method.
☆ Label Filling via Mixed Supervision for Medical Image Segmentation from Noisy Annotations
The success of medical image segmentation usually requires a large number of
high-quality labels. But since the labeling process is usually affected by the
raters' varying skill levels and characteristics, the estimated masks provided
by different raters usually suffer from high inter-rater variability. In this
paper, we propose a simple yet effective Label Filling framework, termed as
LF-Net, predicting the groundtruth segmentation label given only noisy
annotations during training. The fundamental idea of label filling is to
supervise the segmentation model by a subset of pixels with trustworthy labels,
meanwhile filling labels of other pixels by mixed supervision. More concretely,
we propose a qualified majority voting strategy, i.e., a threshold voting
scheme is designed to model agreement among raters and the majority-voted
labels of the selected subset of pixels are regarded as supervision. To fill
labels of other pixels, two types of mixed auxiliary supervision are proposed:
a soft label learned from intrinsic structures of noisy annotations, and
raters' characteristics labels which propagate individual rater's
characteristics information. LF-Net has two main advantages. 1) Training with
trustworthy pixels incorporates training with confident supervision, guiding
the direction of groundtruth label learning. 2) Two types of mixed supervision
prevent over-fitting issues when the network is supervised by a subset of
pixels, and guarantee high fidelity with the true label. Results on five
datasets of diverse imaging modalities show that our LF-Net boosts segmentation
accuracy in all datasets compared with state-of-the-art methods, with even a 7%
improvement in DSC for MS lesion segmentation.
☆ Benchmarking Pathology Foundation Models: Adaptation Strategies and Scenarios
In computational pathology, several foundation models have recently emerged
and demonstrated enhanced learning capability for analyzing pathology images.
However, adapting these models to various downstream tasks remains challenging,
particularly when faced with datasets from different sources and acquisition
conditions, as well as limited data availability. In this study, we benchmark
four pathology-specific foundation models across 14 datasets and two
scenarios-consistency assessment and flexibility assessment-addressing diverse
adaptation scenarios and downstream tasks. In the consistency assessment
scenario, involving five fine-tuning methods, we found that the
parameter-efficient fine-tuning approach was both efficient and effective for
adapting pathology-specific foundation models to diverse datasets within the
same downstream task. In the flexibility assessment scenario under data-limited
environments, utilizing five few-shot learning methods, we observed that the
foundation models benefited more from the few-shot learning methods that
involve modification during the testing phase only. These findings provide
insights that could guide the deployment of pathology-specific foundation
models in real clinical settings, potentially improving the accuracy and
reliability of pathology image analysis. The code for this study is available
at: https://github.com/QuIIL/BenchmarkingPathologyFoundationModels.
☆ Improving the Multi-label Atomic Activity Recognition by Robust Visual Feature and Advanced Attention @ ROAD++ Atomic Activity Recognition 2024
Road++ Track3 proposes a multi-label atomic activity recognition task in
traffic scenarios, which can be standardized as a 64-class multi-label video
action recognition task. In the multi-label atomic activity recognition task,
the robustness of visual feature extraction remains a key challenge, which
directly affects the model performance and generalization ability. To cope with
these issues, our team optimized three aspects: data processing, model and
post-processing. Firstly, the appropriate resolution and video sampling
strategy are selected, and a fixed sampling strategy is set on the validation
and test sets. Secondly, in terms of model training, the team selects a variety
of visual backbone networks for feature extraction, and then introduces the
action-slot model, which is trained on the training and validation sets, and
reasoned on the test set. Finally, for post-processing, the team combined the
strengths and weaknesses of different models for weighted fusion, and the final
mAP on the test set was 58%, which is 4% higher than the challenge baseline.
☆ Few-shot target-driven instance detection based on open-vocabulary object detection models
Current large open vision models could be useful for one and few-shot object
recognition. Nevertheless, gradient-based re-training solutions are costly. On
the other hand, open-vocabulary object detection models bring closer visual and
textual concepts in the same latent space, allowing zero-shot detection via
prompting at small computational cost. We propose a lightweight method to turn
the latter into a one-shot or few-shot object recognition models without
requiring textual descriptions. Our experiments on the TEgO dataset using the
YOLO-World model as a base show that performance increases with the model size,
the number of examples and the use of image augmentation.
☆ START: A Generalized State Space Model with Saliency-Driven Token-Aware Transformation NeurIPS2024
Domain Generalization (DG) aims to enable models to generalize to unseen
target domains by learning from multiple source domains. Existing DG methods
primarily rely on convolutional neural networks (CNNs), which inherently learn
texture biases due to their limited receptive fields, making them prone to
overfitting source domains. While some works have introduced transformer-based
methods (ViTs) for DG to leverage the global receptive field, these methods
incur high computational costs due to the quadratic complexity of
self-attention. Recently, advanced state space models (SSMs), represented by
Mamba, have shown promising results in supervised learning tasks by achieving
linear complexity in sequence length during training and fast RNN-like
computation during inference. Inspired by this, we investigate the
generalization ability of the Mamba model under domain shifts and find that
input-dependent matrices within SSMs could accumulate and amplify
domain-specific features, thus hindering model generalization. To address this
issue, we propose a novel SSM-based architecture with saliency-based
token-aware transformation (namely START), which achieves state-of-the-art
(SOTA) performances and offers a competitive alternative to CNNs and ViTs. Our
START can selectively perturb and suppress domain-specific features in salient
tokens within the input-dependent matrices of SSMs, thus effectively reducing
the discrepancy between different domains. Extensive experiments on five
benchmarks demonstrate that START outperforms existing SOTA DG methods with
efficient linear complexity. Our code is available at
https://github.com/lingeringlight/START.
comment: Accepted by NeurIPS2024. The code is available at
https://github.com/lingeringlight/START
☆ Multispectral Texture Synthesis using RGB Convolutional Neural Networks
State-of-the-art RGB texture synthesis algorithms rely on style distances
that are computed through statistics of deep features. These deep features are
extracted by classification neural networks that have been trained on large
datasets of RGB images. Extending such synthesis methods to multispectral
images is not straightforward, since the pre-trained networks are designed for
and have been trained on RGB images. In this work, we propose two solutions to
extend these methods to multispectral imaging. Neither of them require
additional training of the neural network from which the second order neural
statistics are extracted. The first one consists in optimizing over batches of
random triplets of spectral bands throughout training. The second one projects
multispectral pixels onto a 3 dimensional space. We further explore the benefit
of a color transfer operation upstream of the projection to avoid the
potentially abnormal color distributions induced by the projection. Our
experiments compare the performances of the various methods through different
metrics. We demonstrate that they can be used to perform exemplar-based texture
synthesis, achieve good visual quality and comes close to state-of-the art
methods on RGB bands.
☆ Massimo: Public Queue Monitoring and Management using Mass-Spring Model
An efficient system of a queue control and regulation in public spaces is
very important in order to avoid the traffic jams and to improve the customer
satisfaction. This article offers a detailed road map based on a merger of
intelligent systems and creating an efficient systems of queues in public
places. Through the utilization of different technologies i.e. computer vision,
machine learning algorithms, deep learning our system provide accurate
information about the place is crowded or not and the necessary efforts to be
taken.
comment: 8 pages, 6 figures, 3 algorithms, 3 tables
☆ 3D-GANTex: 3D Face Reconstruction with StyleGAN3-based Multi-View Images and 3DDFA based Mesh Generation
Geometry and texture estimation from a single face image is an ill-posed
problem since there is very little information to work with. The problem
further escalates when the face is rotated at a different angle. This paper
tries to tackle this problem by introducing a novel method for texture
estimation from a single image by first using StyleGAN and 3D Morphable Models.
The method begins by generating multi-view faces using the latent space of GAN.
Then 3DDFA trained on 3DMM estimates a 3D face mesh as well as a
high-resolution texture map that is consistent with the estimated face shape.
The result shows that the generated mesh is of high quality with near to
accurate texture representation.
comment: 7 pages, 4 figures, 2 tables, pre-print version
☆ Visual Representation Learning Guided By Multi-modal Prior Knowledge
Hongkuan Zhou, Lavdim Halilaj, Sebastian Monka, Stefan Schmid, Yuqicheng Zhu, Bo Xiong, Steffen Staab
Despite the remarkable success of deep neural networks (DNNs) in computer
vision, they fail to remain high-performing when facing distribution shifts
between training and testing data. In this paper, we propose Knowledge-Guided
Visual representation learning (KGV), a distribution-based learning approach
leveraging multi-modal prior knowledge, to improve generalization under
distribution shift. We use prior knowledge from two distinct modalities: 1) a
knowledge graph (KG) with hierarchical and association relationships; and 2)
generated synthetic images of visual elements semantically represented in the
KG. The respective embeddings are generated from the given modalities in a
common latent space, i.e., visual embeddings from original and synthetic images
as well as knowledge graph embeddings (KGEs). These embeddings are aligned via
a novel variant of translation-based KGE methods, where the node and relation
embeddings of the KG are modeled as Gaussian distributions and translations
respectively. We claim that incorporating multi-model prior knowledge enables
more regularized learning of image representations. Thus, the models are able
to better generalize across different data distributions. We evaluate KGV on
different image classification tasks with major or minor distribution shifts,
namely road sign classification across datasets from Germany, China, and
Russia, image classification with the mini-ImageNet dataset and its variants,
as well as the DVM-CAR dataset. The results demonstrate that KGV consistently
exhibits higher accuracy and data efficiency than the baselines across all
experiments.
☆ Granularity Matters in Long-Tail Learning
Balancing training on long-tail data distributions remains a long-standing
challenge in deep learning. While methods such as re-weighting and re-sampling
help alleviate the imbalance issue, limited sample diversity continues to
hinder models from learning robust and generalizable feature representations,
particularly for tail classes. In contrast to existing methods, we offer a
novel perspective on long-tail learning, inspired by an observation: datasets
with finer granularity tend to be less affected by data imbalance. In this
paper, we investigate this phenomenon through both quantitative and qualitative
studies, showing that increased granularity enhances the generalization of
learned features in tail categories. Motivated by these findings, we propose a
method to increase dataset granularity through category extrapolation.
Specifically, we introduce open-set auxiliary classes that are visually similar
to existing ones, aiming to enhance representation learning for both head and
tail classes. This forms the core contribution and insight of our approach. To
automate the curation of auxiliary data, we leverage large language models
(LLMs) as knowledge bases to search for auxiliary categories and retrieve
relevant images through web crawling. To prevent the overwhelming presence of
auxiliary classes from disrupting training, we introduce a neighbor-silencing
loss that encourages the model to focus on class discrimination within the
target dataset. During inference, the classifier weights for auxiliary
categories are masked out, leaving only the target class weights for use.
Extensive experiments and ablation studies on three standard long-tail
benchmarks demonstrate the effectiveness of our approach, notably outperforming
strong baseline methods that use the same amount of data. The code will be made
publicly available.
☆ Zero-Shot Scene Reconstruction from Single Images with Deep Prior Assembly NeurIPS 2024
Large language and vision models have been leading a revolution in visual
computing. By greatly scaling up sizes of data and model parameters, the large
models learn deep priors which lead to remarkable performance in various tasks.
In this work, we present deep prior assembly, a novel framework that assembles
diverse deep priors from large models for scene reconstruction from single
images in a zero-shot manner. We show that this challenging task can be done
without extra knowledge but just simply generalizing one deep prior in one
sub-task. To this end, we introduce novel methods related to poses, scales, and
occlusion parsing which are keys to enable deep priors to work together in a
robust way. Deep prior assembly does not require any 3D or 2D data-driven
training in the task and demonstrates superior performance in generalizing
priors to open-world scenes. We conduct evaluations on various datasets, and
report analysis, numerical and visual comparisons with the latest methods to
show our superiority. Project page:
https://junshengzhou.github.io/DeepPriorAssembly.
comment: To appear at NeurIPS 2024. Project page:
https://junshengzhou.github.io/DeepPriorAssembly
☆ A Paradigm Shift in Mouza Map Vectorization: A Human-Machine Collaboration Approach
Mahir Shahriar Dhrubo, Samira Akter, Anwarul Bashir Shuaib, Md Toki Tahmid, Zahid Hasan, A. B. M. Alim Al Islam
Efficient vectorization of hand-drawn cadastral maps, such as Mouza maps in
Bangladesh, poses a significant challenge due to their complex structures.
Current manual digitization methods are time-consuming and labor-intensive. Our
study proposes a semi-automated approach to streamline the digitization
process, saving both time and human resources. Our methodology focuses on
separating the plot boundaries and plot identifiers and applying our
digitization methodology to convert both of them into vectorized format. To
accomplish full vectorization, Convolutional Neural Network (CNN) models are
utilized for pre-processing and plot number detection along with our smoothing
algorithms based on the diversity of vector maps. The CNN models are trained
with our own labeled dataset, generated from the maps, and smoothing algorithms
are introduced from the various observations of the map's vector formats.
Further human intervention remains essential for precision. We have evaluated
our methods on several maps and provided both quantitative and qualitative
results with user study. The result demonstrates that our methodology
outperforms the existing map digitization processes significantly.
comment: 13 pages including reference, 14 figures, 4 tables
☆ Diffusion Transformer Policy
Zhi Hou, Tianyi Zhang, Yuwen Xiong, Hengjun Pu, Chengyang Zhao, Ronglei Tong, Yu Qiao, Jifeng Dai, Yuntao Chen
Recent large visual-language action models pretrained on diverse robot
datasets have demonstrated the potential for generalizing to new environments
with a few in-domain data. However, those approaches usually predict
discretized or continuous actions by a small action head, which limits the
ability in handling diverse action spaces. In contrast, we model the continuous
action with a large multi-modal diffusion transformer, dubbed as Diffusion
Transformer Policy, in which we directly denoise action chunks by a large
transformer model rather than a small action head. By leveraging the scaling
capability of transformers, the proposed approach can effectively model
continuous end-effector actions across large diverse robot datasets, and
achieve better generalization performance. Extensive experiments demonstrate
Diffusion Transformer Policy pretrained on diverse robot data can generalize to
different embodiments, including simulation environments like Maniskill2 and
Calvin, as well as the real-world Franka arm. Specifically, without bells and
whistles, the proposed approach achieves state-of-the-art performance with only
a single third-view camera stream in the Calvin novel task setting (ABC->D),
improving the average number of tasks completed in a row of 5 to 3.6, and the
pretraining stage significantly facilitates the success sequence length on the
Calvin by over 1.2. The code will be publicly available.
comment: Preprint
☆ CamI2V: Camera-Controlled Image-to-Video Diffusion Model
Recently, camera pose, as a user-friendly and physics-related condition, has
been introduced into text-to-video diffusion model for camera control. However,
existing methods simply inject camera conditions through a side input. These
approaches neglect the inherent physical knowledge of camera pose, resulting in
imprecise camera control, inconsistencies, and also poor interpretability. In
this paper, we emphasize the necessity of integrating explicit physical
constraints into model design. Epipolar attention is proposed for modeling all
cross-frame relationships from a novel perspective of noised condition. This
ensures that features are aggregated from corresponding epipolar lines in all
noised frames, overcoming the limitations of current attention mechanisms in
tracking displaced features across frames, especially when features move
significantly with the camera and become obscured by noise. Additionally, we
introduce register tokens to handle cases without intersections between frames,
commonly caused by rapid camera movements, dynamic objects, or occlusions. To
support image-to-video, we propose the multiple guidance scale to allow for
precise control for image, text, and camera, respectively. Furthermore, we
establish a more robust and reproducible evaluation pipeline to solve the
inaccuracy and instability of existing camera control measurement. We achieve a
25.5\% improvement in camera controllability on RealEstate10K while maintaining
strong generalization to out-of-domain images. Only 24GB and 12GB are required
for training and inference, respectively. We plan to release checkpoints, along
with training and evaluation codes. Dynamic videos are best viewed at
\url{https://zgctroy.github.io/CamI2V}.
☆ AI-Driven Approaches for Glaucoma Detection -- A Comprehensive Review
The diagnosis of glaucoma plays a critical role in the management and
treatment of this vision-threatening disease. Glaucoma is a group of eye
diseases that cause blindness by damaging the optic nerve at the back of the
eye. Often called "silent thief of sight", it exhibits no symptoms during the
early stages. Therefore, early detection is crucial to prevent vision loss.
With the rise of Artificial Intelligence (AI), particularly Deep Learning (DL)
techniques, Computer-Aided Diagnosis (CADx) systems have emerged as promising
tools to assist clinicians in accurately diagnosing glaucoma early. This paper
aims to provide a comprehensive overview of AI techniques utilized in CADx
systems for glaucoma diagnosis. Through a detailed analysis of current
literature, we identify key gaps and challenges in these systems, emphasizing
the need for improved safety, reliability, interpretability, and
explainability. By identifying research gaps, we aim to advance the field of
CADx systems especially for the early diagnosis of glaucoma, in order to
prevent any potential loss of vision.
☆ MBPU: A Plug-and-Play State Space Model for Point Cloud Upsamping with Fast Point Rendering
The task of point cloud upsampling (PCU) is to generate dense and uniform
point clouds from sparse input captured by 3D sensors like LiDAR, holding
potential applications in real yet is still a challenging task. Existing deep
learning-based methods have shown significant achievements in this field.
However, they still face limitations in effectively handling long sequences and
addressing the issue of shrinkage artifacts around the surface of the point
cloud. Inspired by the newly proposed Mamba, in this paper, we introduce a
network named MBPU built on top of the Mamba architecture, which performs well
in long sequence modeling, especially for large-scale point cloud upsampling,
and achieves fast convergence speed. Moreover, MBPU is an arbitrary-scale
upsampling framework as the predictor of point distance in the point refinement
phase. At the same time, we simultaneously predict the 3D position shift and 1D
point-to-point distance as regression quantities to constrain the global
features while ensuring the accuracy of local details. We also introduce a fast
differentiable renderer to further enhance the fidelity of the upsampled point
cloud and reduce artifacts. It is noted that, by the merits of our fast point
rendering, MBPU yields high-quality upsampled point clouds by effectively
eliminating surface noise. Extensive experiments have demonstrated that our
MBPU outperforms other off-the-shelf methods in terms of point cloud
upsampling, especially for large-scale point clouds.
☆ Focus on BEV: Self-calibrated Cycle View Transformation for Monocular Birds-Eye-View Segmentation
Birds-Eye-View (BEV) segmentation aims to establish a spatial mapping from
the perspective view to the top view and estimate the semantic maps from
monocular images. Recent studies have encountered difficulties in view
transformation due to the disruption of BEV-agnostic features in image space.
To tackle this issue, we propose a novel FocusBEV framework consisting of $(i)$
a self-calibrated cross view transformation module to suppress the BEV-agnostic
image areas and focus on the BEV-relevant areas in the view transformation
stage, $(ii)$ a plug-and-play ego-motion-based temporal fusion module to
exploit the spatiotemporal structure consistency in BEV space with a memory
bank, and $(iii)$ an occupancy-agnostic IoU loss to mitigate both semantic and
positional uncertainties. Experimental evidence demonstrates that our approach
achieves new state-of-the-art on two popular benchmarks,\ie, 29.2\% mIoU on
nuScenes and 35.2\% mIoU on Argoverse.
☆ GReFEL: Geometry-Aware Reliable Facial Expression Learning under Bias and Imbalanced Data Distribution ACCV 2024
Reliable facial expression learning (FEL) involves the effective learning of
distinctive facial expression characteristics for more reliable, unbiased and
accurate predictions in real-life settings. However, current systems struggle
with FEL tasks because of the variance in people's facial expressions due to
their unique facial structures, movements, tones, and demographics. Biased and
imbalanced datasets compound this challenge, leading to wrong and biased
prediction labels. To tackle these, we introduce GReFEL, leveraging Vision
Transformers and a facial geometry-aware anchor-based reliability balancing
module to combat imbalanced data distributions, bias, and uncertainty in facial
expression learning. Integrating local and global data with anchors that learn
different facial data points and structural features, our approach adjusts
biased and mislabeled emotions caused by intra-class disparity, inter-class
similarity, and scale sensitivity, resulting in comprehensive, accurate, and
reliable facial expression predictions. Our model outperforms current
state-of-the-art methodologies, as demonstrated by extensive experiments on
various datasets.
comment: ACCV 2024. Extended version of ARBEx (arXiv:2305.01486)
☆ Mitigating Object Hallucination via Concentric Causal Attention NeurIPS 2024
Recent Large Vision Language Models (LVLMs) present remarkable zero-shot
conversational and reasoning capabilities given multimodal queries.
Nevertheless, they suffer from object hallucination, a phenomenon where LVLMs
are prone to generate textual responses not factually aligned with image
inputs. Our pilot study reveals that object hallucination is closely tied with
Rotary Position Encoding (RoPE), a widely adopted positional dependency
modeling design in existing LVLMs. Due to the long-term decay in RoPE, LVLMs
tend to hallucinate more when relevant visual cues are distant from instruction
tokens in the multimodal input sequence. Additionally, we observe a similar
effect when reversing the sequential order of visual tokens during multimodal
alignment. Our tests indicate that long-term decay in RoPE poses challenges to
LVLMs while capturing visual-instruction interactions across long distances. We
propose Concentric Causal Attention (CCA), a simple yet effective positional
alignment strategy that mitigates the impact of RoPE long-term decay in LVLMs
by naturally reducing relative distance between visual and instruction tokens.
With CCA, visual tokens can better interact with instruction tokens, thereby
enhancing model's perception capability and alleviating object hallucination.
Without bells and whistles, our positional alignment method surpasses existing
hallucination mitigation strategies by large margins on multiple object
hallucination benchmarks.
comment: To appear at NeurIPS 2024. Code is available at
https://github.com/xing0047/cca-llava
☆ Are Large-scale Soft Labels Necessary for Large-scale Dataset Distillation?
In ImageNet-condensation, the storage for auxiliary soft labels exceeds that
of the condensed dataset by over 30 times. However, are large-scale soft labels
necessary for large-scale dataset distillation? In this paper, we first
discover that the high within-class similarity in condensed datasets
necessitates the use of large-scale soft labels. This high within-class
similarity can be attributed to the fact that previous methods use samples from
different classes to construct a single batch for batch normalization (BN)
matching. To reduce the within-class similarity, we introduce class-wise
supervision during the image synthesizing process by batching the samples
within classes, instead of across classes. As a result, we can increase
within-class diversity and reduce the size of required soft labels. A key
benefit of improved image diversity is that soft label compression can be
achieved through simple random pruning, eliminating the need for complex
rule-based strategies. Experiments validate our discoveries. For example, when
condensing ImageNet-1K to 200 images per class, our approach compresses the
required soft labels from 113 GB to 2.8 GB (40x compression) with a 2.6%
performance gain. Code is available at:
https://github.com/he-y/soft-label-pruning-for-dataset-distillation
comment: Accepted by Neurips 2024
☆ Leveraging CORAL-Correlation Consistency Network for Semi-Supervised Left Atrium MRI Segmentation
Semi-supervised learning (SSL) has been widely used to learn from both a few
labeled images and many unlabeled images to overcome the scarcity of labeled
samples in medical image segmentation. Most current SSL-based segmentation
methods use pixel values directly to identify similar features in labeled and
unlabeled data. They usually fail to accurately capture the intricate
attachment structures in the left atrium, such as the areas of inconsistent
density or exhibit outward curvatures, adding to the complexity of the task. In
this paper, we delve into this issue and introduce an effective solution,
CORAL(Correlation-Aligned)-Correlation Consistency Network (CORN), to capture
the global structure shape and local details of Left Atrium. Diverging from
previous methods focused on each local pixel value, the CORAL-Correlation
Consistency Module (CCM) in the CORN leverages second-order statistical
information to capture global structural features by minimizing the
distribution discrepancy between labeled and unlabeled samples in feature
space. Yet, direct construction of features from unlabeled data frequently
results in ``Sample Selection Bias'', leading to flawed supervision. We thus
further propose the Dynamic Feature Pool (DFP) for the CCM, which utilizes a
confidence-based filtering strategy to remove incorrectly selected features and
regularize both teacher and student models by constraining the similarity
matrix to be consistent. Extensive experiments on the Left Atrium dataset have
shown that the proposed CORN outperforms previous state-of-the-art
semi-supervised learning methods.
comment: 5 pages, 3 figures, Accepted by 2024 IEEE International Conference on
Bioinformatics and Biomedicine (BIBM 2024)
☆ Hybrid Architecture for Real-Time Video Anomaly Detection: Integrating Spatial and Temporal Analysis
We propose a new architecture for real-time anomaly detection in video data,
inspired by human behavior by combining spatial and temporal analyses. This
approach uses two distinct models: for temporal analysis, a recurrent
convolutional network (CNN + RNN) is employed, associating VGG19 and a GRU to
process video sequences. Regarding spatial analysis, it is performed using
YOLOv7 to analyze individual images. These two analyses can be carried out
either in parallel, with a final prediction that combines the results of both
analyses, or in series, where the spatial analysis enriches the data before the
temporal analysis. In this article, we will compare these two architectural
configurations with each other, to evaluate the effectiveness of our hybrid
approach in video anomaly detection.
☆ Seismic Phase Picking
Seismic phase picking, which aims to determine the arrival time of P- and
S-waves according to seismic waveforms, is fundamental to earthquake
monitoring. Generally, manual phase picking is trustworthy, but with the
increasing number of worldwide stations and seismic monitors, it becomes more
challenging for human to complete the task comprehensively. In this work, we
explore multiple ways to do automatic phase picking, including traditional and
learning-based methods.
☆ TexPro: Text-guided PBR Texturing with Procedural Material Modeling
In this paper, we present TexPro, a novel method for high-fidelity material
generation for input 3D meshes given text prompts. Unlike existing
text-conditioned texture generation methods that typically generate RGB
textures with baked lighting, TexPro is able to produce diverse texture maps
via procedural material modeling, which enables physical-based rendering,
relighting, and additional benefits inherent to procedural materials.
Specifically, we first generate multi-view reference images given the input
textual prompt by employing the latest text-to-image model. We then derive
texture maps through a rendering-based optimization with recent differentiable
procedural materials. To this end, we design several techniques to handle the
misalignment between the generated multi-view images and 3D meshes, and
introduce a novel material agent that enhances material classification and
matching by exploring both part-level understanding and object-aware material
reasoning. Experiments demonstrate the superiority of the proposed method over
existing SOTAs and its capability of relighting.
comment: In submission. Supplementary material is included at the end of the
main paper (5 pages, 2 figures)
☆ Foundation Models for Slide-level Cancer Subtyping in Digital Pathology SC
Since the emergence of the ImageNet dataset, the pretraining and fine-tuning
approach has become widely adopted in computer vision due to the ability of
ImageNet-pretrained models to learn a wide variety of visual features. However,
a significant challenge arises when adapting these models to domain-specific
fields, such as digital pathology, due to substantial gaps between domains. To
address this limitation, foundation models (FM) have been trained on
large-scale in-domain datasets to learn the intricate features of
histopathology images. In cancer diagnosis, whole-slide image (WSI) prediction
is essential for patient prognosis, and multiple instance learning (MIL) has
been implemented to handle the giga-pixel size of WSI. As MIL frameworks rely
on patch-level feature aggregation, this work aims to compare the performance
of various feature extractors developed under different pretraining strategies
for cancer subtyping on WSI under a MIL framework. Results demonstrate the
ability of foundation models to surpass ImageNet-pretrained models for the
prediction of six skin cancer subtypes
comment: Manuscript accepted for oral presentation at Decision Science
Allieance -INternational Summer Conference (DSA-ISC) 2024 held on Valencia,
Spain
☆ Distributed Learning for UAV Swarms
Unmanned Aerial Vehicle (UAV) swarms are increasingly deployed in dynamic,
data-rich environments for applications such as environmental monitoring and
surveillance. These scenarios demand efficient data processing while
maintaining privacy and security, making Federated Learning (FL) a promising
solution. FL allows UAVs to collaboratively train global models without sharing
raw data, but challenges arise due to the non-Independent and Identically
Distributed (non-IID) nature of the data collected by UAVs. In this study, we
show an integration of the state-of-the-art FL methods to UAV Swarm application
and invetigate the performance of multiple aggregation methods (namely FedAvg,
FedProx, FedOpt, and MOON) with a particular focus on tackling non-IID on a
variety of datasets, specifically MNIST for baseline performance, CIFAR10 for
natural object classification, EuroSAT for environment monitoring, and CelebA
for surveillance. These algorithms were selected to cover improved techniques
on both client-side updates and global aggregation. Results show that while all
algorithms perform comparably on IID data, their performance deteriorates
significantly under non-IID conditions. FedProx demonstrated the most stable
overall performance, emphasising the importance of regularising local updates
in non-IID environments to mitigate drastic deviations in local models.
☆ MI-VisionShot: Few-shot adaptation of vision-language models for slide-level classification of histopathological images
Vision-language supervision has made remarkable strides in learning visual
representations from textual guidance. In digital pathology, vision-language
models (VLM), pre-trained on curated datasets of histological image-captions,
have been adapted to downstream tasks, such as region of interest
classification. Zero-shot transfer for slide-level prediction has been
formulated by MI-Zero, but it exhibits high variability depending on the
textual prompts. Inspired by prototypical learning, we propose MI-VisionShot, a
training-free adaptation method on top of VLMs to predict slide-level labels in
few-shot learning scenarios. Our framework takes advantage of the excellent
representation learning of VLM to create prototype-based classifiers under a
multiple-instance setting by retrieving the most discriminative patches within
each slide. Experimentation through different settings shows the ability of
MI-VisionShot to surpass zero-shot transfer with lower variability, even in
low-shot scenarios. Code coming soon at
thttps://github.com/cvblab/MIVisionShot.
comment: Manuscript accepted for oral presentation at KES-InnovationInMedicine
2024 held on Madeira, Portugal
☆ Visual Motif Identification: Elaboration of a Curated Comparative Dataset and Classification Methods ECCV 2024
Adam Phillips, Daniel Grandes Rodriguez, Miriam Sánchez-Manzano, Alan Salvadó, Manuel Garin, Gloria Haro, Coloma Ballester
In cinema, visual motifs are recurrent iconographic compositions that carry
artistic or aesthetic significance. Their use throughout the history of visual
arts and media is interesting to researchers and filmmakers alike. Our goal in
this work is to recognise and classify these motifs by proposing a new machine
learning model that uses a custom dataset to that end. We show how features
extracted from a CLIP model can be leveraged by using a shallow network and an
appropriate loss to classify images into 20 different motifs, with surprisingly
good results: an $F_1$-score of 0.91 on our test set. We also present several
ablation studies justifying the input features, architecture and
hyperparameters used.
comment: 17 pages, 11 figures, one table, to be published in the conference
proceedings of ECCV 2024
☆ R2I-rPPG: A Robust Region of Interest Selection Method for Remote Photoplethysmography to Extract Heart Rate
The COVID-19 pandemic has underscored the need for low-cost, scalable
approaches to measuring contactless vital signs, either during initial triage
at a healthcare facility or virtual telemedicine visits. Remote
photoplethysmography (rPPG) can accurately estimate heart rate (HR) when
applied to close-up videos of healthy volunteers in well-lit laboratory
settings. However, results from such highly optimized laboratory studies may
not be readily translated to healthcare settings. One significant barrier to
the practical application of rPPG in health care is the accurate localization
of the region of interest (ROI). Clinical or telemedicine visits may involve
sub-optimal lighting, movement artifacts, variable camera angle, and subject
distance. This paper presents an rPPG ROI selection method based on 3D facial
landmarks and patient head yaw angle. We then demonstrate the robustness of
this ROI selection method when coupled to the Plane-Orthogonal-to-Skin (POS)
rPPG method when applied to videos of patients presenting to an Emergency
Department for respiratory complaints. Our results demonstrate the
effectiveness of our proposed approach in improving the accuracy and robustness
of rPPG in a challenging clinical environment.
comment: preprint
☆ Random Token Fusion for Multi-View Medical Diagnosis NeurIPS 2024
In multi-view medical diagnosis, deep learning-based models often fuse
information from different imaging perspectives to improve diagnostic
performance. However, existing approaches are prone to overfitting and rely
heavily on view-specific features, which can lead to trivial solutions. In this
work, we introduce Random Token Fusion (RTF), a novel technique designed to
enhance multi-view medical image analysis using vision transformers. By
integrating randomness into the feature fusion process during training, RTF
addresses the issue of overfitting and enhances the robustness and accuracy of
diagnostic models without incurring any additional cost at inference. We
validate our approach on standard mammography and chest X-ray benchmark
datasets. Through extensive experiments, we demonstrate that RTF consistently
improves the performance of existing fusion methods, paving the way for a new
generation of multi-view medical foundation models.
comment: Originally published at the NeurIPS 2024 Workshop on Advancements In
Medical Foundation Models: Explainability, Robustness, Security, and Beyond
(AIM-FM)
☆ LiOn-XA: Unsupervised Domain Adaptation via LiDAR-Only Cross-Modal Adversarial Training IROS2024
In this paper, we propose LiOn-XA, an unsupervised domain adaptation (UDA)
approach that combines LiDAR-Only Cross-Modal (X) learning with Adversarial
training for 3D LiDAR point cloud semantic segmentation to bridge the domain
gap arising from environmental and sensor setup changes. Unlike existing works
that exploit multiple data modalities like point clouds and RGB image data, we
address UDA in scenarios where RGB images might not be available and show that
two distinct LiDAR data representations can learn from each other for UDA. More
specifically, we leverage 3D voxelized point clouds to preserve important
geometric structure in combination with 2D projection-based range images that
provide information such as object orientations or surfaces. To further align
the feature space between both domains, we apply adversarial training using
both features and predictions of both 2D and 3D neural networks. Our
experiments on 3 real-to-real adaptation scenarios demonstrate the
effectiveness of our approach, achieving new state-of-the-art performance when
compared to previous uni- and multi-model UDA methods. Our source code is
publicly available at https://github.com/JensLe97/lion-xa.
comment: Preprint, Paper has been accepted at IROS2024
☆ LiMTR: Time Series Motion Prediction for Diverse Road Users through Multimodal Feature Integration NeurIPS 2024
Camiel Oerlemans, Bram Grooten, Michiel Braat, Alaa Alassi, Emilia Silvas, Decebal Constantin Mocanu
Predicting the behavior of road users accurately is crucial to enable the
safe operation of autonomous vehicles in urban or densely populated areas.
Therefore, there has been a growing interest in time series motion prediction
research, leading to significant advancements in state-of-the-art techniques in
recent years. However, the potential of using LiDAR data to capture more
detailed local features, such as a person's gaze or posture, remains largely
unexplored. To address this, we develop a novel multimodal approach for motion
prediction based on the PointNet foundation model architecture, incorporating
local LiDAR features. Evaluation on the Waymo Open Dataset shows a performance
improvement of 6.20% and 1.58% in minADE and mAP respectively, when integrated
and compared with the previous state-of-the-art MTR. We open-source the code of
our LiMTR model.
comment: Accepted at the NeurIPS 2024 workshop Time Series in the Age of Large
Models. Code available at https://github.com/Cing2/LiMTR
☆ Kaninfradet3D:A Road-side Camera-LiDAR Fusion 3D Perception Model based on Nonlinear Feature Extraction and Intrinsic Correlation
With the development of AI-assisted driving, numerous methods have emerged
for ego-vehicle 3D perception tasks, but there has been limited research on
roadside perception. With its ability to provide a global view and a broader
sensing range, the roadside perspective is worth developing. LiDAR provides
precise three-dimensional spatial information, while cameras offer semantic
information. These two modalities are complementary in 3D detection. However,
adding camera data does not increase accuracy in some studies since the
information extraction and fusion procedure is not sufficiently reliable.
Recently, Kolmogorov-Arnold Networks (KANs) have been proposed as replacements
for MLPs, which are better suited for high-dimensional, complex data. Both the
camera and the LiDAR provide high-dimensional information, and employing KANs
should enhance the extraction of valuable features to produce better fusion
outcomes. This paper proposes Kaninfradet3D, which optimizes the feature
extraction and fusion modules. To extract features from complex
high-dimensional data, the model's encoder and fuser modules were improved
using KAN Layers. Cross-attention was applied to enhance feature fusion, and
visual comparisons verified that camera features were more evenly integrated.
This addressed the issue of camera features being abnormally concentrated,
negatively impacting fusion. Compared to the benchmark, our approach shows
improvements of +9.87 mAP and +10.64 mAP in the two viewpoints of the TUMTraf
Intersection Dataset and an improvement of +1.40 mAP in the roadside end of the
TUMTraf V2X Cooperative Perception Dataset. The results indicate that
Kaninfradet3D can effectively fuse features, demonstrating the potential of
applying KANs in roadside perception tasks.
☆ FusionLungNet: Multi-scale Fusion Convolution with Refinement Network for Lung CT Image Segmentation
Early detection of lung cancer is crucial as it increases the chances of
successful treatment. Automatic lung image segmentation assists doctors in
identifying diseases such as lung cancer, COVID-19, and respiratory disorders.
However, lung segmentation is challenging due to overlapping features like
vascular and bronchial structures, along with pixel-level fusion of brightness,
color, and texture. New lung segmentation methods face difficulties in
identifying long-range relationships between image components, reliance on
convolution operations that may not capture all critical features, and the
complex structures of the lungs. Furthermore, semantic gaps between feature
maps can hinder the integration of relevant information, reducing model
accuracy. Skip connections can also limit the decoder's access to complete
information, resulting in partial information loss during encoding. To overcome
these challenges, we propose a hybrid approach using the FusionLungNet network,
which has a multi-level structure with key components, including the ResNet-50
encoder, Channel-wise Aggregation Attention (CAA) module, Multi-scale Feature
Fusion (MFF) block, self refinement (SR) module, and multiple decoders. The
refinement sub-network uses convolutional neural networks for image
post-processing to improve quality. Our method employs a combination of loss
functions, including SSIM, IOU, and focal loss, to optimize image
reconstruction quality. We created and publicly released a new dataset for lung
segmentation called LungSegDB, including 1800 CT images from the LIDC-IDRI
dataset (dataset version 1) and 700 images from the Chest CT Cancer Images from
Kaggle dataset (dataset version 2). Our method achieved an IOU score of 98.04,
outperforming existing methods and demonstrating significant improvements in
segmentation accuracy. https://github.com/sadjadrz/FusionLungNet
☆ Data-Efficient CLIP-Powered Dual-Branch Networks for Source-Free Unsupervised Domain Adaptation
Source-Free Unsupervised Domain Adaptation (SF-UDA) aims to transfer a
model's performance from a labeled source domain to an unlabeled target domain
without direct access to source samples, addressing data privacy issues.
However, most existing SF-UDA approaches assume the availability of abundant
source domain samples, which is often impractical due to the high cost of data
annotation. In this paper, we explore a more challenging scenario where direct
access to source domain samples is restricted, and the source domain contains
only a few samples. To tackle the dual challenges of limited source data and
privacy concerns, we introduce a data-efficient, CLIP-powered dual-branch
network (CDBN in short). We design a cross-modal dual-branch network that
integrates source domain class semantics into the unsupervised fine-tuning of
the target domain. It preserves the class information from the source domain
while enhancing the model's generalization to the target domain. Additionally,
we propose an unsupervised optimization strategy driven by accurate
classification and diversity, which aims to retain the classification
capability learned from the source domain while producing more confident and
diverse predictions in the target domain. Extensive experiments across 31
transfer tasks on 7 public datasets demonstrate that our approach achieves
state-of-the-art performance compared to existing methods.
☆ Assisted Physical Interaction: Autonomous Aerial Robots with Neural Network Detection, Navigation, and Safety Layers
Andrea Berra, Viswa Narayanan Sankaranarayanan, Achilleas Santi Seisa, Julien Mellet, Udayanga G. W. K. N. Gamage, Sumeet Gajanan Satpute, Fabio Ruggiero, Vincenzo Lippiello, Silvia Tolu, Matteo Fumagalli, George Nikolakopoulos, Miguel Ángel Trujillo Soto, Guillermo Heredia
The paper introduces a novel framework for safe and autonomous aerial
physical interaction in industrial settings. It comprises two main components:
a neural network-based target detection system enhanced with edge computing for
reduced onboard computational load, and a control barrier function (CBF)-based
controller for safe and precise maneuvering. The target detection system is
trained on a dataset under challenging visual conditions and evaluated for
accuracy across various unseen data with changing lighting conditions. Depth
features are utilized for target pose estimation, with the entire detection
framework offloaded into low-latency edge computing. The CBF-based controller
enables the UAV to converge safely to the target for precise contact. Simulated
evaluations of both the controller and target detection are presented,
alongside an analysis of real-world detection performance.
comment: 8 pages,14 figures, ICUAS 2024
☆ Habaek: High-performance water segmentation through dataset expansion and inductive bias optimization
Water segmentation is critical to disaster response and water resource
management. Authorities may employ high-resolution photography to monitor
rivers, lakes, and reservoirs, allowing for more proactive management in
agriculture, industry, and conservation. Deep learning has improved flood
monitoring by allowing models like CNNs, U-Nets, and transformers to handle
large volumes of satellite and aerial data. However, these models usually have
significant processing requirements, limiting their usage in real-time
applications. This research proposes upgrading the SegFormer model for water
segmentation by data augmentation with datasets such as ADE20K and RIWA to
boost generalization. We examine how inductive bias affects attention-based
models and discover that SegFormer performs better on bigger datasets. To
further demonstrate the function of data augmentation, Low-Rank Adaptation
(LoRA) is used to lower processing complexity while preserving accuracy. We
show that the suggested Habaek model outperforms current models in
segmentation, with an Intersection over Union (IoU) ranging from 0.91986 to
0.94397. In terms of F1-score, recall, accuracy, and precision, Habaek performs
better than rival models, indicating its potential for real-world applications.
This study highlights the need to enhance structures and include datasets for
effective water segmentation.
☆ WildOcc: A Benchmark for Off-Road 3D Semantic Occupancy Prediction
3D semantic occupancy prediction is an essential part of autonomous driving,
focusing on capturing the geometric details of scenes. Off-road environments
are rich in geometric information, therefore it is suitable for 3D semantic
occupancy prediction tasks to reconstruct such scenes. However, most of
researches concentrate on on-road environments, and few methods are designed
for off-road 3D semantic occupancy prediction due to the lack of relevant
datasets and benchmarks. In response to this gap, we introduce WildOcc, to our
knowledge, the first benchmark to provide dense occupancy annotations for
off-road 3D semantic occupancy prediction tasks. A ground truth generation
pipeline is proposed in this paper, which employs a coarse-to-fine
reconstruction to achieve a more realistic result. Moreover, we introduce a
multi-modal 3D semantic occupancy prediction framework, which fuses
spatio-temporal information from multi-frame images and point clouds at voxel
level. In addition, a cross-modality distillation function is introduced, which
transfers geometric knowledge from point clouds to image features.
☆ An Efficient System for Automatic Map Storytelling -- A Case Study on Historical Maps
Historical maps provide valuable information and knowledge about the past.
However, as they often feature non-standard projections, hand-drawn styles, and
artistic elements, it is challenging for non-experts to identify and interpret
them. While existing image captioning methods have achieved remarkable success
on natural images, their performance on maps is suboptimal as maps are
underrepresented in their pre-training process. Despite the recent advance of
GPT-4 in text recognition and map captioning, it still has a limited
understanding of maps, as its performance wanes when texts (e.g., titles and
legends) in maps are missing or inaccurate. Besides, it is inefficient or even
impractical to fine-tune the model with users' own datasets. To address these
problems, we propose a novel and lightweight map-captioning counterpart.
Specifically, we fine-tune the state-of-the-art vision-language model CLIP to
generate captions relevant to historical maps and enrich the captions with
GPT-3.5 to tell a brief story regarding where, what, when and why of a given
map. We propose a novel decision tree architecture to only generate captions
relevant to the specified map type. Our system shows invariance to text
alterations in maps. The system can be easily adapted and extended to other map
types and scaled to a larger map captioning system. The code is open-sourced at
https://github.com/claudaff/automatic-map-storytelling.
☆ Reducing Hallucinations in Vision-Language Models via Latent Space Steering
Hallucination poses a challenge to the deployment of large vision-language
models (LVLMs) in applications. Unlike in large language models (LLMs),
hallucination in LVLMs often arises from misalignments between visual inputs
and textual outputs. This paper investigates the underlying mechanisms of
hallucination, focusing on the unique structure of LVLMs that distinguishes
them from large language models (LLMs). We identify that hallucinations often
arise from the sensitivity of text decoders to vision inputs, a natural
phenomenon when image encoders and text decoders are pre-trained separately.
Inspired by this, we introduce Visual and Textual Intervention (VTI), a novel
technique designed to reduce hallucinations by steering latent space
representations during inference to enhance the stability of vision features.
As a task-agnostic test-time intervention, VTI can be easily applied to any
problem without additional cost. Extensive experiments demonstrate that it can
effectively reduce hallucinations and outperform baseline methods across
multiple metrics, highlighting the critical role of vision feature stability in
LVLMs.
comment: 21 pages
★ Generalizing Motion Planners with Mixture of Experts for Autonomous Driving
Qiao Sun, Huimin Wang, Jiahao Zhan, Fan Nie, Xin Wen, Leimeng Xu, Kun Zhan, Peng Jia, Xianpeng Lang, Hang Zhao
Large real-world driving datasets have sparked significant research into
various aspects of data-driven motion planners for autonomous driving. These
include data augmentation, model architecture, reward design, training
strategies, and planner pipelines. These planners promise better
generalizations on complicated and few-shot cases than previous methods.
However, experiment results show that many of these approaches produce limited
generalization abilities in planning performance due to overly complex designs
or training paradigms. In this paper, we review and benchmark previous methods
focusing on generalizations. The experimental results indicate that as models
are appropriately scaled, many design elements become redundant. We introduce
StateTransformer-2 (STR2), a scalable, decoder-only motion planner that uses a
Vision Transformer (ViT) encoder and a mixture-of-experts (MoE) causal
Transformer architecture. The MoE backbone addresses modality collapse and
reward balancing by expert routing during training. Extensive experiments on
the NuPlan dataset show that our method generalizes better than previous
approaches across different test sets and closed-loop simulations. Furthermore,
we assess its scalability on billions of real-world urban driving scenarios,
demonstrating consistent accuracy improvements as both data and model size
grow.
comment: 7 pages, 3 figures
☆ Learning to Synthesize Graphics Programs for Geometric Artworks ICPR 2024
Creating and understanding art has long been a hallmark of human ability.
When presented with finished digital artwork, professional graphic artists can
intuitively deconstruct and replicate it using various drawing tools, such as
the line tool, paint bucket, and layer features, including opacity and blending
modes. While most recent research in this field has focused on art generation,
proposing a range of methods, these often rely on the concept of artwork being
represented as a final image. To bridge the gap between pixel-level results and
the actual drawing process, we present an approach that treats a set of drawing
tools as executable programs. This method predicts a sequence of steps to
achieve the final image, allowing for understandable and resolution-independent
reproductions under the usage of a set of drawing commands. Our experiments
demonstrate that our program synthesizer, Art2Prog, can comprehensively
understand complex input images and reproduce them using high-quality
executable programs. The experimental results evidence the potential of
machines to grasp higher-level information from images and generate compact
program-level descriptions.
comment: ICPR 2024
☆ Improving Instance Optimization in Deformable Image Registration with Gradient Projection
Deformable image registration is inherently a multi-objective optimization
(MOO) problem, requiring a delicate balance between image similarity and
deformation regularity. These conflicting objectives often lead to poor
optimization outcomes, such as being trapped in unsatisfactory local minima or
experiencing slow convergence. Deep learning methods have recently gained
popularity in this domain due to their efficiency in processing large datasets
and achieving high accuracy. However, they often underperform during test time
compared to traditional optimization techniques, which further explore
iterative, instance-specific gradient-based optimization. This performance gap
is more pronounced when a distribution shift between training and test data
exists. To address this issue, we focus on the instance optimization (IO)
paradigm, which involves additional optimization for test-time instances based
on a pre-trained model. IO effectively combines the generalization capabilities
of deep learning with the fine-tuning advantages of instance-specific
optimization. Within this framework, we emphasize the use of gradient
projection to mitigate conflicting updates in MOO. This technique projects
conflicting gradients into a common space, better aligning the dual objectives
and enhancing optimization stability. We validate our method using a
state-of-the-art foundation model on the 3D Brain inter-subject registration
task (LUMIR) from the Learn2Reg 2024 Challenge. Our results show significant
improvements over standard gradient descent, leading to more accurate and
reliable registration results.
comment: L2R 2024 Challenge Paper
☆ How Important are Data Augmentations to Close the Domain Gap for Object Detection in Orbit?
We investigate the efficacy of data augmentations to close the domain gap in
spaceborne computer vision, crucial for autonomous operations like on-orbit
servicing. As the use of computer vision in space increases, challenges such as
hostile illumination and low signal-to-noise ratios significantly hinder
performance. While learning-based algorithms show promising results, their
adoption is limited by the need for extensive annotated training data and the
domain gap that arises from differences between synthesized and real-world
imagery. This study explores domain generalization in terms of data
augmentations -- classical color and geometric transformations, corruptions,
and noise -- to enhance model performance across the domain gap. To this end,
we conduct an large scale experiment using a hyperparameter optimization
pipeline that samples hundreds of different configurations and searches for the
best set to bridge the domain gap. As a reference task, we use 2D object
detection and evaluate on the SPEED+ dataset that contains real
hardware-in-the-loop satellite images in its test set. Moreover, we evaluate
four popular object detectors, including Mask R-CNN, Faster R-CNN, YOLO-v7, and
the open set detector GroundingDINO, and highlight their trade-offs between
performance, inference speed, and training time. Our results underscore the
vital role of data augmentations in bridging the domain gap, improving model
performance, robustness, and reliability for critical space applications. As a
result, we propose two novel data augmentations specifically developed to
emulate the visual effects observed in orbital imagery. We conclude by
recommending the most effective augmentations for advancing computer vision in
challenging orbital environments. Code for training detectors and
hyperparameter search will be made publicly available.
☆ DeepIcon: A Hierarchical Network for Layer-wise Icon Vectorization
In contrast to the well-established technique of rasterization, vectorization
of images poses a significant challenge in the field of computer graphics.
Recent learning-based methods for converting raster images to vector formats
frequently suffer from incomplete shapes, redundant path prediction, and a lack
of accuracy in preserving the semantics of the original content. These
shortcomings severely hinder the utility of these methods for further editing
and manipulation of images. To address these challenges, we present DeepIcon, a
novel hierarchical image vectorization network specifically tailored for
generating variable-length icon vector graphics based on the raster image
input. Our experimental results indicate that DeepIcon can efficiently produce
Scalable Vector Graphics (SVGs) directly from raster images, bypassing the need
for a differentiable rasterizer while also demonstrating a profound
understanding of the image contents.
comment: Accepted as Oral Presentation at DICTA 2024
☆ Unleashing the Potential of Vision-Language Pre-Training for 3D Zero-Shot Lesion Segmentation via Mask-Attribute Alignment
Recent advancements in medical vision-language pre-training models have
driven significant progress in zero-shot disease recognition. However,
transferring image-level knowledge to pixel-level tasks, such as lesion
segmentation in 3D CT scans, remains a critical challenge. Due to the
complexity and variability of pathological visual characteristics, existing
methods struggle to align fine-grained lesion features not encountered during
training with disease-related textual representations. In this paper, we
present Malenia, a novel multi-scale lesion-level mask-attribute alignment
framework, specifically designed for 3D zero-shot lesion segmentation. Malenia
improves the compatibility between mask representations and their associated
elemental attributes, explicitly linking the visual features of unseen lesions
with the extensible knowledge learned from previously seen ones. Furthermore,
we design a Cross-Modal Knowledge Injection module to enhance both visual and
textual features with mutually beneficial information, effectively guiding the
generation of segmentation results. Comprehensive experiments across three
datasets and 12 lesion categories validate the superior performance of Malenia.
Codes will be publicly available.
☆ ViMoE: An Empirical Study of Designing Vision Mixture-of-Experts
Xumeng Han, Longhui Wei, Zhiyang Dou, Zipeng Wang, Chenhui Qiang, Xin He, Yingfei Sun, Zhenjun Han, Qi Tian
Mixture-of-Experts (MoE) models embody the divide-and-conquer concept and are
a promising approach for increasing model capacity, demonstrating excellent
scalability across multiple domains. In this paper, we integrate the MoE
structure into the classic Vision Transformer (ViT), naming it ViMoE, and
explore the potential of applying MoE to vision through a comprehensive study
on image classification. However, we observe that the performance is sensitive
to the configuration of MoE layers, making it challenging to obtain optimal
results without careful design. The underlying cause is that inappropriate MoE
layers lead to unreliable routing and hinder experts from effectively acquiring
helpful knowledge. To address this, we introduce a shared expert to learn and
capture common information, serving as an effective way to construct stable
ViMoE. Furthermore, we demonstrate how to analyze expert routing behavior,
revealing which MoE layers are capable of specializing in handling specific
information and which are not. This provides guidance for retaining the
critical layers while removing redundancies, thereby advancing ViMoE to be more
efficient without sacrificing accuracy. We aspire for this work to offer new
insights into the design of vision MoE models and provide valuable empirical
guidance for future research.
★ Object-Centric Temporal Consistency via Conditional Autoregressive Inductive Biases
Cristian Meo, Akihiro Nakano, Mircea Lică, Aniket Didolkar, Masahiro Suzuki, Anirudh Goyal, Mengmi Zhang, Justin Dauwels, Yutaka Matsuo, Yoshua Bengio
Unsupervised object-centric learning from videos is a promising approach
towards learning compositional representations that can be applied to various
downstream tasks, such as prediction and reasoning. Recently, it was shown that
pretrained Vision Transformers (ViTs) can be useful to learn object-centric
representations on real-world video datasets. However, while these approaches
succeed at extracting objects from the scenes, the slot-based representations
fail to maintain temporal consistency across consecutive frames in a video,
i.e. the mapping of objects to slots changes across the video. To address this,
we introduce Conditional Autoregressive Slot Attention (CA-SA), a framework
that enhances the temporal consistency of extracted object-centric
representations in video-centric vision tasks. Leveraging an autoregressive
prior network to condition representations on previous timesteps and a novel
consistency loss function, CA-SA predicts future slot representations and
imposes consistency across frames. We present qualitative and quantitative
results showing that our proposed method outperforms the considered baselines
on downstream tasks, such as video prediction and visual question-answering
tasks.
☆ Students Rather Than Experts: A New AI For Education Pipeline To Model More Human-Like And Personalised Early Adolescences
The capabilities of large language models (LLMs) have been applied in expert
systems across various domains, providing new opportunities for AI in
Education. Educational interactions involve a cyclical exchange between
teachers and students. Current research predominantly focuses on using LLMs to
simulate teachers, leveraging their expertise to enhance student learning
outcomes. However, the simulation of students, which could improve teachers'
instructional skills, has received insufficient attention due to the challenges
of modeling and evaluating virtual students. This research asks: Can LLMs be
utilized to develop virtual student agents that mimic human-like behavior and
individual variability? Unlike expert systems focusing on knowledge delivery,
virtual students must replicate learning difficulties, emotional responses, and
linguistic uncertainties. These traits present significant challenges in both
modeling and evaluation. To address these issues, this study focuses on
language learning as a context for modeling virtual student agents. We propose
a novel AI4Education framework, called SOE (Scene-Object-Evaluation), to
systematically construct LVSA (LLM-based Virtual Student Agents). By curating a
dataset of personalized teacher-student interactions with various personality
traits, question types, and learning stages, and fine-tuning LLMs using LoRA,
we conduct multi-dimensional evaluation experiments. Specifically, we: (1)
develop a theoretical framework for generating LVSA; (2) integrate human
subjective evaluation metrics into GPT-4 assessments, demonstrating a strong
correlation between human evaluators and GPT-4 in judging LVSA authenticity;
and (3) validate that LLMs can generate human-like, personalized virtual
student agents in educational contexts, laying a foundation for future
applications in pre-service teacher training and multi-agent simulation
environments.
☆ PALMS: Plane-based Accessible Indoor Localization Using Mobile Smartphones
In this paper, we present PALMS, an innovative indoor global localization and
relocalization system for mobile smartphones that utilizes publicly available
floor plans. Unlike most vision-based methods that require constant visual
input, our system adopts a dynamic form of localization that considers a single
instantaneous observation and odometry data. The core contribution of this work
is the introduction of a particle filter initialization method that leverages
the Certainly Empty Space (CES) constraint along with principal orientation
matching. This approach creates a spatial probability distribution of the
device's location, significantly improving localization accuracy and reducing
particle filter convergence time. Our experimental evaluations demonstrate that
PALMS outperforms traditional methods with uniformly initialized particle
filters, providing a more efficient and accessible approach to indoor
wayfinding. By eliminating the need for prior environmental fingerprinting,
PALMS provides a scalable and practical approach to indoor navigation.
comment: 7 pages, 3 figures, accepted to the 14th International Conference on
Indoor Positioning and Indoor Navigation (IPIN) 2024, Best Presentation Award
☆ Enhancing SNN-based Spatio-Temporal Learning: A Benchmark Dataset and Cross-Modality Attention Model
Spiking Neural Networks (SNNs), renowned for their low power consumption,
brain-inspired architecture, and spatio-temporal representation capabilities,
have garnered considerable attention in recent years. Similar to Artificial
Neural Networks (ANNs), high-quality benchmark datasets are of great importance
to the advances of SNNs. However, our analysis indicates that many prevalent
neuromorphic datasets lack strong temporal correlation, preventing SNNs from
fully exploiting their spatio-temporal representation capabilities. Meanwhile,
the integration of event and frame modalities offers more comprehensive visual
spatio-temporal information. Yet, the SNN-based cross-modality fusion remains
underexplored.
In this work, we present a neuromorphic dataset called DVS-SLR that can
better exploit the inherent spatio-temporal properties of SNNs. Compared to
existing datasets, it offers advantages in terms of higher temporal
correlation, larger scale, and more varied scenarios. In addition, our
neuromorphic dataset contains corresponding frame data, which can be used for
developing SNN-based fusion methods. By virtue of the dual-modal feature of the
dataset, we propose a Cross-Modality Attention (CMA) based fusion method. The
CMA model efficiently utilizes the unique advantages of each modality, allowing
for SNNs to learn both temporal and spatial attention scores from the
spatio-temporal features of event and frame modalities, subsequently allocating
these scores across modalities to enhance their synergy. Experimental results
demonstrate that our method not only improves recognition accuracy but also
ensures robustness across diverse scenarios.
☆ RANSAC Back to SOTA: A Two-stage Consensus Filtering for Real-time 3D Registration
Correspondence-based point cloud registration (PCR) plays a key role in
robotics and computer vision. However, challenges like sensor noises, object
occlusions, and descriptor limitations inevitably result in numerous outliers.
RANSAC family is the most popular outlier removal solution. However, the
requisite iterations escalate exponentially with the outlier ratio, rendering
it far inferior to existing methods (SC2PCR [1], MAC [2], etc.) in terms of
accuracy or speed. Thus, we propose a two-stage consensus filtering (TCF) that
elevates RANSAC to state-of-the-art (SOTA) speed and accuracy. Firstly,
one-point RANSAC obtains a consensus set based on length consistency.
Subsequently, two-point RANSAC refines the set via angle consistency. Then,
three-point RANSAC computes a coarse pose and removes outliers based on
transformed correspondence's distances. Drawing on optimizations from one-point
and two-point RANSAC, three-point RANSAC requires only a few iterations.
Eventually, an iterative reweighted least squares (IRLS) is applied to yield
the optimal pose. Experiments on the large-scale KITTI and ETH datasets
demonstrate our method achieves up to three-orders-of-magnitude speedup
compared to MAC while maintaining registration accuracy and recall. Our code is
available at https://github.com/ShiPC-AI/TCF.
comment: 8 pages, 8 figures
☆ TALoS: Enhancing Semantic Scene Completion via Test-time Adaptation on the Line of Sight NeurIPS 2024
Semantic Scene Completion (SSC) aims to perform geometric completion and
semantic segmentation simultaneously. Despite the promising results achieved by
existing studies, the inherently ill-posed nature of the task presents
significant challenges in diverse driving scenarios. This paper introduces
TALoS, a novel test-time adaptation approach for SSC that excavates the
information available in driving environments. Specifically, we focus on that
observations made at a certain moment can serve as Ground Truth (GT) for scene
completion at another moment. Given the characteristics of the LiDAR sensor, an
observation of an object at a certain location confirms both 1) the occupation
of that location and 2) the absence of obstacles along the line of sight from
the LiDAR to that point. TALoS utilizes these observations to obtain
self-supervision about occupancy and emptiness, guiding the model to adapt to
the scene in test time. In a similar manner, we aggregate reliable SSC
predictions among multiple moments and leverage them as semantic pseudo-GT for
adaptation. Further, to leverage future observations that are not accessible at
the current time, we present a dual optimization scheme using the model in
which the update is delayed until the future observation is available.
Evaluations on the SemanticKITTI validation and test sets demonstrate that
TALoS significantly improves the performance of the pre-trained SSC model. Our
code is available at https://github.com/blue-531/TALoS.
comment: Accepted at NeurIPS 2024. Code is available at
https://github.com/blue-531/TALoS
☆ Transforming Blood Cell Detection and Classification with Advanced Deep Learning Models: A Comparative Study
Efficient detection and classification of blood cells are vital for accurate
diagnosis and effective treatment of blood disorders. This study utilizes a
YOLOv10 model trained on Roboflow data with images resized to 640x640 pixels
across varying epochs. The results show that increased training epochs
significantly enhance accuracy, precision, and recall, particularly in
real-time blood cell detection & classification. The YOLOv10 model outperforms
MobileNetV2, ShuffleNetV2, and DarkNet in real-time performance, though
MobileNetV2 and ShuffleNetV2 are more computationally efficient, and DarkNet
excels in feature extraction for blood cell classification. This research
highlights the potential of integrating deep learning models like YOLOv10,
MobileNetV2, ShuffleNetV2, and DarkNet into clinical workflows, promising
improvements in diagnostic accuracy and efficiency. Additionally, a new,
well-annotated blood cell dataset was created and will be open-sourced to
support further advancements in automatic blood cell detection and
classification. The findings demonstrate the transformative impact of these
models in revolutionizing medical diagnostics and enhancing blood disorder
management
comment: 26 pages, 4884 Words, 17 Figures, 10 Tables
☆ Calibration of ordinal regression networks
Recent studies have shown that deep neural networks are not well-calibrated
and produce over-confident predictions. The miscalibration issue primarily
stems from the minimization of cross-entropy, which aims to align predicted
softmax probabilities with one-hot labels. In ordinal regression tasks, this
problem is compounded by an additional challenge: the expectation that softmax
probabilities should exhibit unimodal distribution is not met with
cross-entropy. Rather, the ordinal regression literature has focused on
unimodality and overlooked calibration. To address these issues, we propose a
novel loss function that introduces order-aware calibration, ensuring that
prediction confidence adheres to ordinal relationships between classes. It
incorporates soft ordinal encoding and label-smoothing-based regularization to
enforce both calibration and unimodality. Extensive experiments across three
popular ordinal regression benchmarks demonstrate that our approach achieves
state-of-the-art calibration without compromising accuracy.
☆ CL-HOI: Cross-Level Human-Object Interaction Distillation from Vision Large Language Models
Human-object interaction (HOI) detection has seen advancements with Vision
Language Models (VLMs), but these methods often depend on extensive manual
annotations. Vision Large Language Models (VLLMs) can inherently recognize and
reason about interactions at the image level but are computationally heavy and
not designed for instance-level HOI detection. To overcome these limitations,
we propose a Cross-Level HOI distillation (CL-HOI) framework, which distills
instance-level HOIs from VLLMs image-level understanding without the need for
manual annotations. Our approach involves two stages: context distillation,
where a Visual Linguistic Translator (VLT) converts visual information into
linguistic form, and interaction distillation, where an Interaction Cognition
Network (ICN) reasons about spatial, visual, and context relations. We design
contrastive distillation losses to transfer image-level context and interaction
knowledge from the teacher to the student model, enabling instance-level HOI
detection. Evaluations on HICO-DET and V-COCO datasets demonstrate that our
CL-HOI surpasses existing weakly supervised methods and VLLM supervised
methods, showing its efficacy in detecting HOIs without manual labels.
☆ Resource-Efficient Medical Report Generation using Large Language Models
Medical report generation is the task of automatically writing radiology
reports for chest X-ray images. Manually composing these reports is a
time-consuming process that is also prone to human errors. Generating medical
reports can therefore help reduce the burden on radiologists. In other words,
we can promote greater clinical automation in the medical domain. In this work,
we propose a new framework leveraging vision-enabled Large Language Models
(LLM) for the task of medical report generation. We introduce a lightweight
solution that achieves better or comparative performance as compared to
previous solutions on the task of medical report generation. We conduct
extensive experiments exploring different model sizes and enhancement
approaches, such as prefix tuning to improve the text generation abilities of
the LLMs. We evaluate our approach on a prominent large-scale radiology report
dataset - MIMIC-CXR. Our results demonstrate the capability of our
resource-efficient framework to generate patient-specific reports with strong
medical contextual understanding and high precision.
☆ LucidFusion: Generating 3D Gaussians with Arbitrary Unposed Images
Recent large reconstruction models have made notable progress in generating
high-quality 3D objects from single images. However, these methods often
struggle with controllability, as they lack information from multiple views,
leading to incomplete or inconsistent 3D reconstructions. To address this
limitation, we introduce LucidFusion, a flexible end-to-end feed-forward
framework that leverages the Relative Coordinate Map (RCM). Unlike traditional
methods linking images to 3D world thorough pose, LucidFusion utilizes RCM to
align geometric features coherently across different views, making it highly
adaptable for 3D generation from arbitrary, unposed images. Furthermore,
LucidFusion seamlessly integrates with the original single-image-to-3D
pipeline, producing detailed 3D Gaussians at a resolution of $512 \times 512$,
making it well-suited for a wide range of applications.
comment: 17 pages, 12 figures, project page: coming soon
☆ Fully Explicit Dynamic Gaussian Splatting NeurIPS 2024
3D Gaussian Splatting has shown fast and high-quality rendering results in
static scenes by leveraging dense 3D prior and explicit representations.
Unfortunately, the benefits of the prior and representation do not involve
novel view synthesis for dynamic motions. Ironically, this is because the main
barrier is the reliance on them, which requires increasing training and
rendering times to account for dynamic motions. In this paper, we design a
Explicit 4D Gaussian Splatting(Ex4DGS). Our key idea is to firstly separate
static and dynamic Gaussians during training, and to explicitly sample
positions and rotations of the dynamic Gaussians at sparse timestamps. The
sampled positions and rotations are then interpolated to represent both
spatially and temporally continuous motions of objects in dynamic scenes as
well as reducing computational cost. Additionally, we introduce a progressive
training scheme and a point-backtracking technique that improves Ex4DGS's
convergence. We initially train Ex4DGS using short timestamps and progressively
extend timestamps, which makes it work well with a few point clouds. The
point-backtracking is used to quantify the cumulative error of each Gaussian
over time, enabling the detection and removal of erroneous Gaussians in dynamic
scenes. Comprehensive experiments on various scenes demonstrate the
state-of-the-art rendering quality from our method, achieving fast rendering of
62 fps on a single 2080Ti GPU.
comment: Accepted at NeurIPS 2024
☆ Towards Kriging-informed Conditional Diffusion for Regional Sea-Level Data Downscaling
Given coarser-resolution projections from global climate models or satellite
data, the downscaling problem aims to estimate finer-resolution regional
climate data, capturing fine-scale spatial patterns and variability.
Downscaling is any method to derive high-resolution data from low-resolution
variables, often to provide more detailed and local predictions and analyses.
This problem is societally crucial for effective adaptation, mitigation, and
resilience against significant risks from climate change. The challenge arises
from spatial heterogeneity and the need to recover finer-scale features while
ensuring model generalization. Most downscaling methods \cite{Li2020} fail to
capture the spatial dependencies at finer scales and underperform on real-world
climate datasets, such as sea-level rise. We propose a novel Kriging-informed
Conditional Diffusion Probabilistic Model (Ki-CDPM) to capture spatial
variability while preserving fine-scale features. Experimental results on
climate data show that our proposed method is more accurate than
state-of-the-art downscaling techniques.
☆ Erasing Undesirable Concepts in Diffusion Models with Adversarial Preservation
Diffusion models excel at generating visually striking content from text but
can inadvertently produce undesirable or harmful content when trained on
unfiltered internet data. A practical solution is to selectively removing
target concepts from the model, but this may impact the remaining concepts.
Prior approaches have tried to balance this by introducing a loss term to
preserve neutral content or a regularization term to minimize changes in the
model parameters, yet resolving this trade-off remains challenging. In this
work, we propose to identify and preserving concepts most affected by parameter
changes, termed as \textit{adversarial concepts}. This approach ensures stable
erasure with minimal impact on the other concepts. We demonstrate the
effectiveness of our method using the Stable Diffusion model, showing that it
outperforms state-of-the-art erasure methods in eliminating unwanted content
while maintaining the integrity of other unrelated elements. Our code is
available at
\url{https://github.com/tuananhbui89/Erasing-Adversarial-Preservation}.
☆ Joint Top-Down and Bottom-Up Frameworks for 3D Visual Grounding ICPR2024
This paper tackles the challenging task of 3D visual grounding-locating a
specific object in a 3D point cloud scene based on text descriptions. Existing
methods fall into two categories: top-down and bottom-up methods. Top-down
methods rely on a pre-trained 3D detector to generate and select the best
bounding box, resulting in time-consuming processes. Bottom-up methods directly
regress object bounding boxes with coarse-grained features, producing worse
results. To combine their strengths while addressing their limitations, we
propose a joint top-down and bottom-up framework, aiming to enhance the
performance while improving the efficiency. Specifically, in the first stage,
we propose a bottom-up based proposal generation module, which utilizes
lightweight neural layers to efficiently regress and cluster several coarse
object proposals instead of using a complex 3D detector. Then, in the second
stage, we introduce a top-down based proposal consolidation module, which
utilizes graph design to effectively aggregate and propagate the query-related
object contexts among the generated proposals for further refinement. By
jointly training these two modules, we can avoid the inherent drawbacks of the
complex proposals in the top-down framework and the coarse proposals in the
bottom-up framework. Experimental results on the ScanRefer benchmark show that
our framework is able to achieve the state-of-the-art performance.
comment: Accepted by ICPR2024
☆ Topology-Aware Exploration of Circle of Willis for CTA and MRA: Segmentation, Detection, and Classification MICCAI 2024
The Circle of Willis (CoW) vessels is critical to connecting major
circulations of the brain. The topology of the vascular structure is clinical
significance to evaluate the risk, severity of the neuro-vascular diseases. The
CoW has two representative angiographic imaging modalities, computed tomography
angiography (CTA) and magnetic resonance angiography (MRA). TopCow24 provided
125 paired CTA-MRA dataset for the analysis of CoW. To explore both CTA and MRA
images in a unified framework to learn the inherent topology of Cow, we
construct the universal dataset via independent intensity preprocess, followed
by joint resampling and normarlization. Then, we utilize the topology-aware
loss to enhance the topology completeness of the CoW and the discrimination
between different classes. A complementary topology-aware refinement is further
conducted to enhance the connectivity within the same class. Our method was
evaluated on all the three tasks and two modalities, achieving competitive
results. In the final test phase of TopCow24 Challenge, we achieved the second
place in the CTA-Seg-Task, the third palce in the CTA-Box-Task, the first place
in the CTA-Edg-Task, the second place in the MRA-Seg-Task, the third palce in
the MRA-Box-Task, the second place in the MRA-Edg-Task.
comment: Participation technical report for TopCoW24 challenge @ MICCAI 2024
☆ Exploring Stronger Transformer Representation Learning for Occluded Person Re-Identificatio
Due to some complex factors (e.g., occlusion, pose variation and diverse
camera perspectives), extracting stronger feature representation in person
re-identification remains a challenging task. In this paper, we proposed a
novel self-supervision and supervision combining transformer-based person
re-identification framework, namely SSSC-TransReID. Different from the general
transformer-based person re-identification models, we designed a
self-supervised contrastive learning branch, which can enhance the feature
representation for person re-identification without negative samples or
additional pre-training. In order to train the contrastive learning branch, we
also proposed a novel random rectangle mask strategy to simulate the occlusion
in real scenes, so as to enhance the feature representation for occlusion.
Finally, we utilized the joint-training loss function to integrate the
advantages of supervised learning with ID tags and self-supervised contrastive
learning without negative samples, which can reinforce the ability of our model
to excavate stronger discriminative features, especially for occlusion.
Extensive experimental results on several benchmark datasets show our proposed
model obtains superior Re-ID performance consistently and outperforms the
state-of-the-art ReID methods by large margins on the mean average accuracy
(mAP) and Rank-1 accuracy.
☆ Deep Active Learning with Manifold-preserving Trajectory Sampling
Active learning (AL) is for optimizing the selection of unlabeled data for
annotation (labeling), aiming to enhance model performance while minimizing
labeling effort. The key question in AL is which unlabeled data should be
selected for annotation. Existing deep AL methods arguably suffer from bias
incurred by clabeled data, which takes a much lower percentage than unlabeled
data in AL context. We observe that such an issue is severe in different types
of data, such as vision and non-vision data. To address this issue, we propose
a novel method, namely Manifold-Preserving Trajectory Sampling (MPTS), aiming
to enforce the feature space learned from labeled data to represent a more
accurate manifold. By doing so, we expect to effectively correct the bias
incurred by labeled data, which can cause a biased selection of unlabeled data.
Despite its focus on manifold, the proposed method can be conveniently
implemented by performing distribution mapping with MMD (Maximum Mean
Discrepancies). Extensive experiments on various vision and non-vision
benchmark datasets demonstrate the superiority of our method. Our source code
can be found here.
☆ P-YOLOv8: Efficient and Accurate Real-Time Detection of Distracted Driving
Distracted driving is a critical safety issue that leads to numerous
fatalities and injuries worldwide. This study addresses the urgent need for
efficient and real-time machine learning models to detect distracted driving
behaviors. Leveraging the Pretrained YOLOv8 (P-YOLOv8) model, a real-time
object detection system is introduced, optimized for both speed and accuracy.
This approach addresses the computational constraints and latency limitations
commonly associated with conventional detection models. The study demonstrates
P-YOLOv8 versatility in both object detection and image classification tasks
using the Distracted Driver Detection dataset from State Farm, which includes
22,424 images across ten behavior categories. Our research explores the
application of P-YOLOv8 for image classification, evaluating its performance
compared to deep learning models such as VGG16, VGG19, and ResNet. Some
traditional models often struggle with low accuracy, while others achieve high
accuracy but come with high computational costs and slow detection speeds,
making them unsuitable for real-time applications. P-YOLOv8 addresses these
issues by achieving competitive accuracy with significant computational cost
and efficiency advantages. In particular, P-YOLOv8 generates a lightweight
model with a size of only 2.84 MB and a lower number of parameters, totaling
1,451,098, due to its innovative architecture. It achieves a high accuracy of
99.46 percent with this small model size, opening new directions for deployment
on inexpensive and small embedded devices using Tiny Machine Learning (TinyML).
The experimental results show robust performance, making P-YOLOv8 a
cost-effective solution for real-time deployment. This study provides a
detailed analysis of P-YOLOv8's architecture, training, and performance
benchmarks, highlighting its potential for real-time use in detecting
distracted driving.
★ Deep Learning and Machine Learning -- Object Detection and Semantic Segmentation: From Theory to Applications
Jintao Ren, Ziqian Bi, Qian Niu, Junyu Liu, Benji Peng, Sen Zhang, Xuanhe Pan, Jinlang Wang, Keyu Chen, Caitlyn Heqi Yin, Pohsun Feng, Yizhu Wen, Tianyang Wang, Silin Chen, Ming Li, Jiawei Xu, Ming Liu
This book offers an in-depth exploration of object detection and semantic
segmentation, combining theoretical foundations with practical applications. It
covers state-of-the-art advancements in machine learning and deep learning,
with a focus on convolutional neural networks (CNNs), YOLO architectures, and
transformer-based approaches like DETR. The book also delves into the
integration of artificial intelligence (AI) techniques and large language
models for enhanced object detection in complex environments. A thorough
discussion of big data analysis is presented, highlighting the importance of
data processing, model optimization, and performance evaluation metrics. By
bridging the gap between traditional methods and modern deep learning
frameworks, this book serves as a comprehensive guide for researchers, data
scientists, and engineers aiming to leverage AI-driven methodologies in
large-scale object detection tasks.
comment: 167 pages
☆ ARTS: Semi-Analytical Regressor using Disentangled Skeletal Representations for Human Mesh Recovery from Videos ACM MM 2024
Although existing video-based 3D human mesh recovery methods have made
significant progress, simultaneously estimating human pose and shape from
low-resolution image features limits their performance. These image features
lack sufficient spatial information about the human body and contain various
noises (e.g., background, lighting, and clothing), which often results in
inaccurate pose and inconsistent motion. Inspired by the rapid advance in human
pose estimation, we discover that compared to image features, skeletons
inherently contain accurate human pose and motion. Therefore, we propose a
novel semiAnalytical Regressor using disenTangled Skeletal representations for
human mesh recovery from videos, called ARTS. Specifically, a skeleton
estimation and disentanglement module is proposed to estimate the 3D skeletons
from a video and decouple them into disentangled skeletal representations
(i.e., joint position, bone length, and human motion). Then, to fully utilize
these representations, we introduce a semi-analytical regressor to estimate the
parameters of the human mesh model. The regressor consists of three modules:
Temporal Inverse Kinematics (TIK), Bone-guided Shape Fitting (BSF), and
Motion-Centric Refinement (MCR). TIK utilizes joint position to estimate
initial pose parameters and BSF leverages bone length to regress bone-aligned
shape parameters. Finally, MCR combines human motion representation with image
features to refine the initial human model parameters. Extensive experiments
demonstrate that our ARTS surpasses existing state-of-the-art video-based
methods in both per-frame accuracy and temporal consistency on popular
benchmarks: 3DPW, MPI-INF-3DHP, and Human3.6M. Code is available at
https://github.com/TangTao-PKU/ARTS.
comment: Accepted by ACM MM 2024. Project page:
https://github.com/TangTao-PKU/ARTS
☆ Multimodal Learning for Embryo Viability Prediction in Clinical IVF MICCAI 2024
Junsik Kim, Zhiyi Shi, Davin Jeong, Johannes Knittel, Helen Y. Yang, Yonghyun Song, Wanhua Li, Yicong Li, Dalit Ben-Yosef, Daniel Needleman, Hanspeter Pfister
In clinical In-Vitro Fertilization (IVF), identifying the most viable embryo
for transfer is important to increasing the likelihood of a successful
pregnancy. Traditionally, this process involves embryologists manually
assessing embryos' static morphological features at specific intervals using
light microscopy. This manual evaluation is not only time-intensive and costly,
due to the need for expert analysis, but also inherently subjective, leading to
variability in the selection process. To address these challenges, we develop a
multimodal model that leverages both time-lapse video data and Electronic
Health Records (EHRs) to predict embryo viability. One of the primary
challenges of our research is to effectively combine time-lapse video and EHR
data, owing to their inherent differences in modality. We comprehensively
analyze our multimodal model with various modality inputs and integration
approaches. Our approach will enable fast and automated embryo viability
predictions in scale for clinical IVF.
comment: Accepted to MICCAI 2024
☆ Online Pseudo-Label Unified Object Detection for Multiple Datasets Training
The Unified Object Detection (UOD) task aims to achieve object detection of
all merged categories through training on multiple datasets, and is of great
significance in comprehensive object detection scenarios. In this paper, we
conduct a thorough analysis of the cross datasets missing annotations issue,
and propose an Online Pseudo-Label Unified Object Detection scheme. Our method
uses a periodically updated teacher model to generate pseudo-labels for the
unlabelled objects in each sub-dataset. This periodical update strategy could
better ensure that the accuracy of the teacher model reaches the local maxima
and maximized the quality of pseudo-labels. In addition, we survey the
influence of overlapped region proposals on the accuracy of box regression. We
propose a category specific box regression and a pseudo-label RPN head to
improve the recall rate of the Region Proposal Network (PRN). Our experimental
results on common used benchmarks (\eg COCO, Object365 and OpenImages)
indicates that our online pseudo-label UOD method achieves higher accuracy than
existing SOTA methods.
☆ A Dual Process VLA: Efficient Robotic Manipulation Leveraging VLM
Vision-Language-Action (VLA) models are receiving increasing attention for
their ability to enable robots to perform complex tasks by integrating visual
context with linguistic commands. However, achieving efficient real-time
performance remains challenging due to the high computational demands of
existing models. To overcome this, we propose Dual Process VLA (DP-VLA), a
hierarchical framework inspired by dual-process theory. DP-VLA utilizes a Large
System 2 Model (L-Sys2) for complex reasoning and decision-making, while a
Small System 1 Model (S-Sys1) handles real-time motor control and sensory
processing. By leveraging Vision-Language Models (VLMs), the L-Sys2 operates at
low frequencies, reducing computational overhead, while the S-Sys1 ensures fast
and accurate task execution. Experimental results on the RoboCasa dataset
demonstrate that DP-VLA achieves faster inference and higher task success
rates, providing a scalable solution for advanced robotic applications.
comment: 10 page
♻ ☆ Toward Generalizing Visual Brain Decoding to Unseen Subjects
Visual brain decoding aims to decode visual information from human brain
activities. Despite the great progress, one critical limitation of current
brain decoding research lies in the lack of generalization capability to unseen
subjects. Prior works typically focus on decoding brain activity of individuals
based on the observation that different subjects exhibit different brain
activities, while it remains unclear whether brain decoding can be generalized
to unseen subjects. This study aims to answer this question. We first
consolidate an image-fMRI dataset consisting of stimulus-image and
fMRI-response pairs, involving 177 subjects in the movie-viewing task of the
Human Connectome Project (HCP). This dataset allows us to investigate the brain
decoding performance with the increase of participants. We then present a
learning paradigm that applies uniform processing across all subjects, instead
of employing different network heads or tokenizers for individuals as in
previous methods, which can accommodate a large number of subjects to explore
the generalization capability across different subjects. A series of
experiments are conducted and we have the following findings. First, the
network exhibits clear generalization capabilities with the increase of
training subjects. Second, the generalization capability is common to popular
network architectures (MLP, CNN and Transformer). Third, the generalization
performance is affected by the similarity between subjects. Our findings reveal
the inherent similarities in brain activities across individuals. With the
emerging of larger and more comprehensive datasets, it is possible to train a
brain decoding foundation model in the future. Codes and models can be found at
https://github.com/Xiangtaokong/TGBD.
♻ ☆ Utilizing Large Language Models in An Iterative Paradigm with Domain Feedback for Molecule Optimization
Molecule optimization is a critical task in drug discovery to optimize
desired properties of a given molecule through chemical modification. Despite
Large Language Models (LLMs) holding the potential to efficiently simulate this
task by using natural language to direct the optimization, straightforwardly
utilizing shows limited performance. In this work, we facilitate utilizing LLMs
in an iterative paradigm by proposing a simple yet highly effective domain
feedback provider, namely $\text{Re}^2$DF. In detail, $\text{Re}^2$DF harnesses
an external toolkit, RDKit, to handle the molecule hallucination, if the
modified molecule is chemically invalid. Otherwise, its desired properties are
computed and compared to the original one, establishing reliable domain
feedback with correct direction and distance towards the objective, followed by
a retrieved example, to explicitly guide the LLM to refine the modified
molecule. We conduct experiments across both single- and multi-property
objectives with 2 thresholds, where $\text{Re}^2$DF shows significant
improvements. Particularly, for 20 single-property objectives, $\text{Re}^2$DF
enhances Hit ratio by 16.95% and 20.76% under loose and strict thresholds,
respectively. For 32 multi-property objectives, $\text{Re}^2$DF enhances Hit
ratio by 6.04% and 5.25%.
♻ ☆ Decomposing and Interpreting Image Representations via Text in ViTs Beyond CLIP NeurIPS 2024
Recent work has explored how individual components of the CLIP-ViT model
contribute to the final representation by leveraging the shared image-text
representation space of CLIP. These components, such as attention heads and
MLPs, have been shown to capture distinct image features like shape, color or
texture. However, understanding the role of these components in arbitrary
vision transformers (ViTs) is challenging. To this end, we introduce a general
framework which can identify the roles of various components in ViTs beyond
CLIP. Specifically, we (a) automate the decomposition of the final
representation into contributions from different model components, and (b)
linearly map these contributions to CLIP space to interpret them via text.
Additionally, we introduce a novel scoring function to rank components by their
importance with respect to specific features. Applying our framework to various
ViT variants (e.g. DeiT, DINO, DINOv2, Swin, MaxViT), we gain insights into the
roles of different components concerning particular image features. These
insights facilitate applications such as image retrieval using text
descriptions or reference images, visualizing token importance heatmaps, and
mitigating spurious correlations. We release our code to reproduce the
experiments at https://github.com/SriramB-98/vit-decompose
comment: NeurIPS 2024, 31 pages, 15 figures
♻ ☆ RACCooN: A Versatile Instructional Video Editing Framework with Auto-Generated Narratives
Recent video generative models primarily rely on carefully written text
prompts for specific tasks, like inpainting or style editing. They require
labor-intensive textual descriptions for input videos, hindering their
flexibility to adapt personal/raw videos to user specifications. This paper
proposes RACCooN, a versatile and user-friendly video-to-paragraph-to-video
generative framework that supports multiple video editing capabilities such as
removal, addition, and modification, through a unified pipeline. RACCooN
consists of two principal stages: Video-to-Paragraph (V2P) and
Paragraph-to-Video (P2V). In the V2P stage, we automatically describe video
scenes in well-structured natural language, capturing both the holistic context
and focused object details. Subsequently, in the P2V stage, users can
optionally refine these descriptions to guide the video diffusion model,
enabling various modifications to the input video, such as removing, changing
subjects, and/or adding new objects. The proposed approach stands out from
other methods through several significant contributions: (1) RACCooN suggests a
multi-granular spatiotemporal pooling strategy to generate well-structured
video descriptions, capturing both the broad context and object details without
requiring complex human annotations, simplifying precise video content editing
based on text for users. (2) Our video generative model incorporates
auto-generated narratives or instructions to enhance the quality and accuracy
of the generated content. (3) RACCooN also plans to imagine new objects in a
given video, so users simply prompt the model to receive a detailed video
editing plan for complex video editing. The proposed framework demonstrates
impressive versatile capabilities in video-to-paragraph generation, video
content editing, and can be incorporated into other SoTA video generative
models for further enhancement.
comment: The first two authors contribute equally. Project Page:
https://raccoon-mllm-gen.github.io/
♻ ☆ Human-Agent Joint Learning for Efficient Robot Manipulation Skill Acquisition
Shengcheng Luo, Quanquan Peng, Jun Lv, Kaiwen Hong, Katherine Rose Driggs-Campbell, Cewu Lu, Yong-Lu Li
Employing a teleoperation system for gathering demonstrations offers the
potential for more efficient learning of robot manipulation. However,
teleoperating a robot arm equipped with a dexterous hand or gripper, via a
teleoperation system presents inherent challenges due to the task's high
dimensionality, complexity of motion, and differences between physiological
structures. In this study, we introduce a novel system for joint learning
between human operators and robots, that enables human operators to share
control of a robot end-effector with a learned assistive agent, simplifies the
data collection process, and facilitates simultaneous human demonstration
collection and robot manipulation training. As data accumulates, the assistive
agent gradually learns. Consequently, less human effort and attention are
required, enhancing the efficiency of the data collection process. It also
allows the human operator to adjust the control ratio to achieve a trade-off
between manual and automated control. We conducted experiments in both
simulated environments and physical real-world settings. Through user studies
and quantitative evaluations, it is evident that the proposed system could
enhance data collection efficiency and reduce the need for human adaptation
while ensuring the collected data is of sufficient quality for downstream
tasks. \textit{For more details, please refer to our webpage
https://norweig1an.github.io/HAJL.github.io/.
comment: 8 pages, 6 figures
♻ ☆ CoTCoNet: An Optimized Coupled Transformer-Convolutional Network with an Adaptive Graph Reconstruction for Leukemia Detection
Swift and accurate blood smear analysis is an effective diagnostic method for
leukemia and other hematological malignancies. However, manual leukocyte count
and morphological evaluation using a microscope is time-consuming and prone to
errors. Conventional image processing methods also exhibit limitations in
differentiating cells due to the visual similarity between malignant and benign
cell morphology. This limitation is further compounded by the skewed training
data that hinders the extraction of reliable and pertinent features. In
response to these challenges, we propose an optimized Coupled Transformer
Convolutional Network (CoTCoNet) framework for the classification of leukemia,
which employs a well-designed transformer integrated with a deep convolutional
network to effectively capture comprehensive global features and scalable
spatial patterns, enabling the identification of complex and large-scale
hematological features. Further, the framework incorporates a graph-based
feature reconstruction module to reveal the hidden or unobserved hard-to-see
biological features of leukocyte cells and employs a Population-based
Meta-Heuristic Algorithm for feature selection and optimization. To mitigate
data imbalance issues, we employ a synthetic leukocyte generator. In the
evaluation phase, we initially assess CoTCoNet on a dataset containing 16,982
annotated cells, and it achieves remarkable accuracy and F1-Score rates of
0.9894 and 0.9893, respectively. To broaden the generalizability of our model,
we evaluate it across four publicly available diverse datasets, which include
the aforementioned dataset. This evaluation demonstrates that our method
outperforms current state-of-the-art approaches. We also incorporate an
explainability approach in the form of feature visualization closely aligned
with cell annotations to provide a deeper understanding of the framework.
♻ ☆ PUMA: Empowering Unified MLLM with Multi-granular Visual Generation
Rongyao Fang, Chengqi Duan, Kun Wang, Hao Li, Hao Tian, Xingyu Zeng, Rui Zhao, Jifeng Dai, Hongsheng Li, Xihui Liu
Recent advancements in multimodal foundation models have yielded significant
progress in vision-language understanding. Initial attempts have also explored
the potential of multimodal large language models (MLLMs) for visual content
generation. However, existing works have insufficiently addressed the varying
granularity demands of different image generation tasks within a unified MLLM
paradigm - from the diversity required in text-to-image generation to the
precise controllability needed in image manipulation. In this work, we propose
PUMA, emPowering Unified MLLM with Multi-grAnular visual generation. PUMA
unifies multi-granular visual features as both inputs and outputs of MLLMs,
elegantly addressing the different granularity requirements of various image
generation tasks within a unified MLLM framework. Following multimodal
pretraining and task-specific instruction tuning, PUMA demonstrates proficiency
in a wide range of multimodal tasks. This work represents a significant step
towards a truly unified MLLM capable of adapting to the granularity demands of
various visual tasks. The code and model will be released in
https://github.com/rongyaofang/PUMA.
comment: Project page: https://rongyaofang.github.io/puma/
♻ ☆ Pre-processing and Compression: Understanding Hidden Representation Refinement Across Imaging Domains via Intrinsic Dimension NeurIPS 2024
In recent years, there has been interest in how geometric properties such as
intrinsic dimension (ID) of a neural network's hidden representations change
through its layers, and how such properties are predictive of important model
behavior such as generalization ability. However, evidence has begun to emerge
that such behavior can change significantly depending on the domain of the
network's training data, such as natural versus medical images. Here, we
further this inquiry by exploring how the ID of a network's learned
representations changes through its layers, in essence, characterizing how the
network successively refines the information content of input data to be used
for predictions. Analyzing eleven natural and medical image datasets across six
network architectures, we find that how ID changes through the network differs
noticeably between natural and medical image models. Specifically, medical
image models peak in representation ID earlier in the network, implying a
difference in the image features and their abstractness that are typically used
for downstream tasks in these domains. Additionally, we discover a strong
correlation of this peak representation ID with the ID of the data in its input
space, implying that the intrinsic information content of a model's learned
representations is guided by that of the data it was trained on. Overall, our
findings emphasize notable discrepancies in network behavior between natural
and non-natural imaging domains regarding hidden representation information
content, and provide further insights into how a network's learned features are
shaped by its training data.
comment: Published in NeurIPS 2024 Workshop on Scientific Methods for
Understanding Deep Learning (SciForDL)
♻ ☆ SETA: Semantic-Aware Token Augmentation for Domain Generalization
Domain generalization (DG) aims to enhance the model robustness against
domain shifts without accessing target domains. A prevalent category of methods
for DG is data augmentation, which focuses on generating virtual samples to
simulate domain shifts. However, existing augmentation techniques in DG are
mainly tailored for convolutional neural networks (CNNs), with limited
exploration in token-based architectures, i.e., vision transformer (ViT) and
multi-layer perceptrons (MLP) models. In this paper, we study the impact of
prior CNN-based augmentation methods on token-based models, revealing their
performance is suboptimal due to the lack of incentivizing the model to learn
holistic shape information. To tackle the issue, we propose the SEmantic-aware
Token Augmentation (SETA) method. SETA transforms token features by perturbing
local edge cues while preserving global shape features, thereby enhancing the
model learning of shape information. To further enhance the generalization
ability of the model, we introduce two stylized variants of our method combined
with two state-of-the-art style augmentation methods in DG. We provide a
theoretical insight into our method, demonstrating its effectiveness in
reducing the generalization risk bound. Comprehensive experiments on five
benchmarks prove that our method achieves SOTA performances across various ViT
and MLP architectures. Our code is available at
https://github.com/lingeringlight/SETA.
comment: Accepted by IEEE TIP 2024. The code is available at
https://github.com/lingeringlight/SETA
♻ ☆ Machine Unlearning in Forgettability Sequence
Machine unlearning (MU) is becoming a promising paradigm to achieve the
"right to be forgotten", where the training trace of any chosen data points
could be eliminated, while maintaining the model utility on general testing
samples after unlearning. With the advancement of forgetting research, many
fundamental open questions remain unanswered: do different samples exhibit
varying levels of difficulty in being forgotten? Further, does the sequence in
which samples are forgotten, determined by their respective difficulty levels,
influence the performance of forgetting algorithms? In this paper, we identify
key factor affecting unlearning difficulty and the performance of unlearning
algorithms. We find that samples with higher privacy risks are more likely to
be unlearning, indicating that the unlearning difficulty varies among different
samples which motives a more precise unlearning mode. Built upon this insight,
we propose a general unlearning framework, dubbed RSU, which consists of
Ranking module and SeqUnlearn module.
comment: The senior authors of the draft are not fully convinced that the
novelty is significant enough for this submission compared to the latest
research progress in this area. Additionally, the senior authors have
identified writing issues. Based on these two reasons, we have decided to
withdraw the draft from arXiv
♻ ☆ From FDG to PSMA: A Hitchhiker's Guide to Multitracer, Multicenter Lesion Segmentation in PET/CT Imaging
Maximilian Rokuss, Balint Kovacs, Yannick Kirchhoff, Shuhan Xiao, Constantin Ulrich, Klaus H. Maier-Hein, Fabian Isensee
Automated lesion segmentation in PET/CT scans is crucial for improving
clinical workflows and advancing cancer diagnostics. However, the task is
challenging due to physiological variability, different tracers used in PET
imaging, and diverse imaging protocols across medical centers. To address this,
the autoPET series was created to challenge researchers to develop algorithms
that generalize across diverse PET/CT environments. This paper presents our
solution for the autoPET III challenge, targeting multitracer, multicenter
generalization using the nnU-Net framework with the ResEncL architecture. Key
techniques include misalignment data augmentation and multi-modal pretraining
across CT, MR, and PET datasets to provide an initial anatomical understanding.
We incorporate organ supervision as a multitask approach, enabling the model to
distinguish between physiological uptake and tracer-specific patterns, which is
particularly beneficial in cases where no lesions are present. Compared to the
default nnU-Net, which achieved a Dice score of 57.61, or the larger ResEncL
(65.31) our model significantly improved performance with a Dice score of
68.40, alongside a reduction in false positive (FPvol: 7.82) and false negative
(FNvol: 10.35) volumes. These results underscore the effectiveness of combining
advanced network design, augmentation, pretraining, and multitask learning for
PET/CT lesion segmentation. After evaluation on the test set, our approach was
awarded the first place in the model-centric category (Team LesionTracer). Code
is publicly available at https://github.com/MIC-DKFZ/autopet-3-submission.
comment: Winning method of the autoPET III challenge (model-centric) - Team
LesionTracer
♻ ☆ Deep Correlated Prompting for Visual Recognition with Missing Modalities NeurIPS 2024
Large-scale multimodal models have shown excellent performance over a series
of tasks powered by the large corpus of paired multimodal training data.
Generally, they are always assumed to receive modality-complete inputs.
However, this simple assumption may not always hold in the real world due to
privacy constraints or collection difficulty, where models pretrained on
modality-complete data easily demonstrate degraded performance on
missing-modality cases. To handle this issue, we refer to prompt learning to
adapt large pretrained multimodal models to handle missing-modality scenarios
by regarding different missing cases as different types of input. Instead of
only prepending independent prompts to the intermediate layers, we present to
leverage the correlations between prompts and input features and excavate the
relationships between different layers of prompts to carefully design the
instructions. We also incorporate the complementary semantics of different
modalities to guide the prompting design for each modality. Extensive
experiments on three commonly-used datasets consistently demonstrate the
superiority of our method compared to the previous approaches upon different
missing scenarios. Plentiful ablations are further given to show the
generalizability and reliability of our method upon different modality-missing
ratios and types.
comment: NeurIPS 2024, add some results
♻ ☆ UNetMamba: An Efficient UNet-Like Mamba for Semantic Segmentation of High-Resolution Remote Sensing Images
Semantic segmentation of high-resolution remote sensing images is vital in
downstream applications such as land-cover mapping, urban planning and disaster
assessment.Existing Transformer-based methods suffer from the constraint
between accuracy and efficiency, while the recently proposed Mamba is renowned
for being efficient. Therefore, to overcome the dilemma, we propose UNetMamba,
a UNet-like semantic segmentation model based on Mamba. It incorporates a mamba
segmentation decoder (MSD) that can efficiently decode the complex information
within high-resolution images, and a local supervision module (LSM), which is
train-only but can significantly enhance the perception of local contents.
Extensive experiments demonstrate that UNetMamba outperforms the
state-of-the-art methods with mIoU increased by 0.87% on LoveDA and 0.39% on
ISPRS Vaihingen, while achieving high efficiency through the lightweight
design, less memory footprint and reduced computational cost. The source code
is available at https://github.com/EnzeZhu2001/UNetMamba.
comment: 5 pages, 3 figures
♻ ☆ A gradient-based approach to fast and accurate head motion compensation in cone-beam CT
Mareike Thies, Fabian Wagner, Noah Maul, Haijun Yu, Manuela Goldmann, Linda-Sophie Schneider, Mingxuan Gu, Siyuan Mei, Lukas Folle, Alexander Preuhs, Michael Manhart, Andreas Maier
Cone-beam computed tomography (CBCT) systems, with their flexibility, present
a promising avenue for direct point-of-care medical imaging, particularly in
critical scenarios such as acute stroke assessment. However, the integration of
CBCT into clinical workflows faces challenges, primarily linked to long scan
duration resulting in patient motion during scanning and leading to image
quality degradation in the reconstructed volumes. This paper introduces a novel
approach to CBCT motion estimation using a gradient-based optimization
algorithm, which leverages generalized derivatives of the backprojection
operator for cone-beam CT geometries. Building on that, a fully differentiable
target function is formulated which grades the quality of the current motion
estimate in reconstruction space. We drastically accelerate motion estimation
yielding a 19-fold speed-up compared to existing methods. Additionally, we
investigate the architecture of networks used for quality metric regression and
propose predicting voxel-wise quality maps, favoring autoencoder-like
architectures over contracting ones. This modification improves gradient flow,
leading to more accurate motion estimation. The presented method is evaluated
through realistic experiments on head anatomy. It achieves a reduction in
reprojection error from an initial average of 3mm to 0.61mm after motion
compensation and consistently demonstrates superior performance compared to
existing approaches. The analytic Jacobian for the backprojection operation,
which is at the core of the proposed method, is made publicly available. In
summary, this paper contributes to the advancement of CBCT integration into
clinical workflows by proposing a robust motion estimation approach that
enhances efficiency and accuracy, addressing critical challenges in
time-sensitive scenarios.
comment: \copyright 2024 IEEE. Personal use of this material is permitted.
Permission from IEEE must be obtained for all other uses, in any current or
future media, including reprinting/republishing this material for advertising
or promotional purposes, creating new collective works, for resale or
redistribution to servers or lists, or reuse of any copyrighted component of
this work in other works
♻ ☆ VeLoRA: Memory Efficient Training using Rank-1 Sub-Token Projections NeurIPS 2024
Large language models (LLMs) have recently emerged as powerful tools for
tackling many language-processing tasks. Despite their success, training and
fine-tuning these models is still far too computationally and memory intensive.
In this paper, we identify and characterise the important components needed for
effective model convergence using gradient descent. In doing so we find that
the intermediate activations used to implement backpropagation can be
excessively compressed without incurring any degradation in performance. This
result leads us to a cheap and memory-efficient algorithm for both fine-tuning
and pre-training LLMs. The proposed algorithm simply divides the tokens up into
smaller sub-tokens before projecting them onto a fixed 1-dimensional subspace
during the forward pass. These features are then coarsely reconstructed during
the backward pass to implement the update rules. We confirm the effectiveness
of our algorithm as being complimentary to many state-of-the-art PEFT methods
on the VTAB-1k fine-tuning benchmark. Furthermore, we outperform QLoRA for
fine-tuning LLaMA and show competitive performance against other
memory-efficient pre-training methods on the large-scale C4 dataset.
comment: NeurIPS 2024. Code available at https://github.com/roymiles/VeLoRA
♻ ☆ Towards Realistic Data Generation for Real-World Super-Resolution
Existing image super-resolution (SR) techniques often fail to generalize
effectively in complex real-world settings due to the significant divergence
between training data and practical scenarios. To address this challenge,
previous efforts have either manually simulated intricate physical-based
degradations or utilized learning-based techniques, yet these approaches remain
inadequate for producing large-scale, realistic, and diverse data
simultaneously. In this paper, we introduce a novel Realistic Decoupled Data
Generator (RealDGen), an unsupervised learning data generation framework
designed for real-world super-resolution. We meticulously develop content and
degradation extraction strategies, which are integrated into a novel
content-degradation decoupled diffusion model to create realistic
low-resolution images from unpaired real LR and HR images. Extensive
experiments demonstrate that RealDGen excels in generating large-scale,
high-quality paired data that mirrors real-world degradations, significantly
advancing the performance of popular SR models on various real-world
benchmarks.
♻ ★ CARLA Drone: Monocular 3D Object Detection from a Different Perspective
Existing techniques for monocular 3D detection have a serious restriction.
They tend to perform well only on a limited set of benchmarks, faring well
either on ego-centric car views or on traffic camera views, but rarely on both.
To encourage progress, this work advocates for an extended evaluation of 3D
detection frameworks across different camera perspectives. We make two key
contributions. First, we introduce the CARLA Drone dataset, CDrone. Simulating
drone views, it substantially expands the diversity of camera perspectives in
existing benchmarks. Despite its synthetic nature, CDrone represents a
real-world challenge. To show this, we confirm that previous techniques
struggle to perform well both on CDrone and a real-world 3D drone dataset.
Second, we develop an effective data augmentation pipeline called GroundMix.
Its distinguishing element is the use of the ground for creating 3D-consistent
augmentation of a training image. GroundMix significantly boosts the detection
accuracy of a lightweight one-stage detector. In our expanded evaluation, we
achieve the average precision on par with or substantially higher than the
previous state of the art across all tested datasets.
♻ ☆ UADA3D: Unsupervised Adversarial Domain Adaptation for 3D Object Detection with Sparse LiDAR and Large Domain Gaps
In this study, we address a gap in existing unsupervised domain adaptation
approaches on LiDAR-based 3D object detection, which have predominantly
concentrated on adapting between established, high-density autonomous driving
datasets. We focus on sparser point clouds, capturing scenarios from different
perspectives: not just from vehicles on the road but also from mobile robots on
sidewalks, which encounter significantly different environmental conditions and
sensor configurations. We introduce Unsupervised Adversarial Domain Adaptation
for 3D Object Detection (UADA3D). UADA3D does not depend on pre-trained source
models or teacher-student architectures. Instead, it uses an adversarial
approach to directly learn domain-invariant features. We demonstrate its
efficacy in various adaptation scenarios, showing significant improvements in
both self-driving car and mobile robot domains. Our code is open-source and
will be available soon.
comment: Accepted for IEEE RA-L 2024
♻ ☆ HeightFormer: A Semantic Alignment Monocular 3D Object Detection Method from Roadside Perspective
The on-board 3D object detection technology has received extensive attention
as a critical technology for autonomous driving, while few studies have focused
on applying roadside sensors in 3D traffic object detection. Existing studies
achieve the projection of 2D image features to 3D features through height
estimation based on the frustum. However, they did not consider the height
alignment and the extraction efficiency of bird's-eye-view features. We propose
a novel 3D object detection framework integrating Spatial Former and Voxel
Pooling Former to enhance 2D-to-3D projection based on height estimation.
Extensive experiments were conducted using the Rope3D and DAIR-V2X-I dataset,
and the results demonstrated the outperformance of the proposed algorithm in
the detection of both vehicles and cyclists. These results indicate that the
algorithm is robust and generalized under various detection scenarios.
Improving the accuracy of 3D object detection on the roadside is conducive to
building a safe and trustworthy intelligent transportation system of
vehicle-road coordination and promoting the large-scale application of
autonomous driving. The code and pre-trained models will be released on
https://anonymous.4open.science/r/HeightFormer.
♻ ☆ DARES: Depth Anything in Robotic Endoscopic Surgery with Self-supervised Vector-LoRA of the Foundation Model
Mona Sheikh Zeinoddin, Chiara Lena, Jiongqi Qu, Luca Carlini, Mattia Magro, Seunghoi Kim, Elena De Momi, Sophia Bano, Matthew Grech-Sollars, Evangelos Mazomenos, Daniel C. Alexander, Danail Stoyanov, Matthew J. Clarkson, Mobarakol Islam
Robotic-assisted surgery (RAS) relies on accurate depth estimation for 3D
reconstruction and visualization. While foundation models like Depth Anything
Models (DAM) show promise, directly applying them to surgery often yields
suboptimal results. Fully fine-tuning on limited surgical data can cause
overfitting and catastrophic forgetting, compromising model robustness and
generalization. Although Low-Rank Adaptation (LoRA) addresses some adaptation
issues, its uniform parameter distribution neglects the inherent feature
hierarchy, where earlier layers, learning more general features, require more
parameters than later ones. To tackle this issue, we introduce Depth Anything
in Robotic Endoscopic Surgery (DARES), a novel approach that employs a new
adaptation technique, Vector Low-Rank Adaptation (Vector-LoRA) on the DAM V2 to
perform self-supervised monocular depth estimation in RAS scenes. To enhance
learning efficiency, we introduce Vector-LoRA by integrating more parameters in
earlier layers and gradually decreasing parameters in later layers. We also
design a reprojection loss based on the multi-scale SSIM error to enhance depth
perception by better tailoring the foundation model to the specific
requirements of the surgical environment. The proposed method is validated on
the SCARED dataset and demonstrates superior performance over recent
state-of-the-art self-supervised monocular depth estimation techniques,
achieving an improvement of 13.3% in the absolute relative error metric. The
code and pre-trained weights are available at
https://github.com/mobarakol/DARES.
comment: 11 pages
♻ ★ Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding
Yiwen Tang, Ray Zhang, Jiaming Liu, Zoey Guo, Dong Wang, Zhigang Wang, Bin Zhao, Shanghang Zhang, Peng Gao, Hongsheng Li, Xuelong Li
Large foundation models have recently emerged as a prominent focus of
interest, attaining superior performance in widespread scenarios. Due to the
scarcity of 3D data, many efforts have been made to adapt pre-trained
transformers from vision to 3D domains. However, such 2D-to-3D approaches are
still limited, due to the potential loss of spatial geometries and high
computation cost. More importantly, their frameworks are mainly designed for 2D
models, lacking a general any-to-3D paradigm. In this paper, we introduce
Any2Point, a parameter-efficient method to empower any-modality large models
(vision, language, audio) for 3D understanding. Given a frozen transformer from
any source modality, we propose a 3D-to-any (1D or 2D) virtual projection
strategy that correlates the input 3D points to the original 1D or 2D positions
within the source modality. This mechanism enables us to assign each 3D token
with a positional encoding paired with the pre-trained model, which avoids 3D
geometry loss caused by the true projection and better motivates the
transformer for 3D learning with 1D/2D positional priors. Then, within each
transformer block, we insert an any-to-3D guided adapter module for
parameter-efficient fine-tuning. The adapter incorporates prior spatial
knowledge from the source modality to guide the local feature aggregation of 3D
tokens, compelling the semantic adaption of any-modality transformers. We
conduct extensive experiments to showcase the effectiveness and efficiency of
our method. Code and models are released at
https://github.com/Ivan-Tang-3D/Any2Point.
comment: Code and models are released at
https://github.com/Ivan-Tang-3D/Any2Point
♻ ☆ Point-PEFT: Parameter-Efficient Fine-Tuning for 3D Pre-trained Models
The popularity of pre-trained large models has revolutionized downstream
tasks across diverse fields, such as language, vision, and multi-modality. To
minimize the adaption cost for downstream tasks, many Parameter-Efficient
Fine-Tuning (PEFT) techniques are proposed for language and 2D image
pre-trained models. However, the specialized PEFT method for 3D pre-trained
models is still under-explored. To this end, we introduce Point-PEFT, a novel
framework for adapting point cloud pre-trained models with minimal learnable
parameters. Specifically, for a pre-trained 3D model, we freeze most of its
parameters, and only tune the newly added PEFT modules on downstream tasks,
which consist of a Point-prior Prompt and a Geometry-aware Adapter. The
Point-prior Prompt adopts a set of learnable prompt tokens, for which we
propose to construct a memory bank with domain-specific knowledge, and utilize
a parameter-free attention to enhance the prompt tokens. The Geometry-aware
Adapter aims to aggregate point cloud features within spatial neighborhoods to
capture fine-grained geometric information through local interactions.
Extensive experiments indicate that our Point-PEFT can achieve better
performance than the full fine-tuning on various downstream tasks, while using
only 5% of the trainable parameters, demonstrating the efficiency and
effectiveness of our approach. Code is released at
https://github.com/Ivan-Tang-3D/Point-PEFT.
comment: The specialized PEFT framework for 3D pre-trained models, which
achieves competitive performance to full fine-tuning, and significantly
reduces the computational resources. Project page:
https://github.com/Ivan-Tang-3D/Point-PEFT
♻ ☆ Fool Me Once? Contrasting Textual and Visual Explanations in a Clinical Decision-Support Setting EMNLP 2024
Maxime Kayser, Bayar Menzat, Cornelius Emde, Bogdan Bercean, Alex Novak, Abdala Espinosa, Bartlomiej W. Papiez, Susanne Gaube, Thomas Lukasiewicz, Oana-Maria Camburu
The growing capabilities of AI models are leading to their wider use,
including in safety-critical domains. Explainable AI (XAI) aims to make these
models safer to use by making their inference process more transparent.
However, current explainability methods are seldom evaluated in the way they
are intended to be used: by real-world end users. To address this, we conducted
a large-scale user study with 85 healthcare practitioners in the context of
human-AI collaborative chest X-ray analysis. We evaluated three types of
explanations: visual explanations (saliency maps), natural language
explanations, and a combination of both modalities. We specifically examined
how different explanation types influence users depending on whether the AI
advice and explanations are factually correct. We find that text-based
explanations lead to significant over-reliance, which is alleviated by
combining them with saliency maps. We also observe that the quality of
explanations, that is, how much factually correct information they entail, and
how much this aligns with AI correctness, significantly impacts the usefulness
of the different explanation types.
comment: EMNLP 2024
♻ ☆ Diffusion Lens: Interpreting Text Encoders in Text-to-Image Pipelines ACL 2024
Text-to-image diffusion models (T2I) use a latent representation of a text
prompt to guide the image generation process. However, the process by which the
encoder produces the text representation is unknown. We propose the Diffusion
Lens, a method for analyzing the text encoder of T2I models by generating
images from its intermediate representations. Using the Diffusion Lens, we
perform an extensive analysis of two recent T2I models. Exploring compound
prompts, we find that complex scenes describing multiple objects are composed
progressively and more slowly compared to simple scenes; Exploring knowledge
retrieval, we find that representation of uncommon concepts requires further
computation compared to common concepts, and that knowledge retrieval is
gradual across layers. Overall, our findings provide valuable insights into the
text encoder component in T2I pipelines.
comment: Published in: ACL 2024 Project webpage:
tokeron.github.io/DiffusionLensWeb
♻ ☆ DriveDreamer4D: World Models Are Effective Data Machines for 4D Driving Scene Representation
Guosheng Zhao, Chaojun Ni, Xiaofeng Wang, Zheng Zhu, Xueyang Zhang, Yida Wang, Guan Huang, Xinze Chen, Boyuan Wang, Youyi Zhang, Wenjun Mei, Xingang Wang
Closed-loop simulation is essential for advancing end-to-end autonomous
driving systems. Contemporary sensor simulation methods, such as NeRF and 3DGS,
rely predominantly on conditions closely aligned with training data
distributions, which are largely confined to forward-driving scenarios.
Consequently, these methods face limitations when rendering complex maneuvers
(e.g., lane change, acceleration, deceleration). Recent advancements in
autonomous-driving world models have demonstrated the potential to generate
diverse driving videos. However, these approaches remain constrained to 2D
video generation, inherently lacking the spatiotemporal coherence required to
capture intricacies of dynamic driving environments. In this paper, we
introduce DriveDreamer4D, which enhances 4D driving scene representation
leveraging world model priors. Specifically, we utilize the world model as a
data machine to synthesize novel trajectory videos based on real-world driving
data. Notably, we explicitly leverage structured conditions to control the
spatial-temporal consistency of foreground and background elements, thus the
generated data adheres closely to traffic constraints. To our knowledge,
DriveDreamer4D is the first to utilize video generation models for improving 4D
reconstruction in driving scenarios. Experimental results reveal that
DriveDreamer4D significantly enhances generation quality under novel trajectory
views, achieving a relative improvement in FID by 24.5%, 39.0%, and 10.5%
compared to PVG, S3Gaussian, and Deformable-GS. Moreover, DriveDreamer4D
markedly enhances the spatiotemporal coherence of driving agents, which is
verified by a comprehensive user study and the relative increases of 20.3%,
42.0%, and 13.7% in the NTA-IoU metric.
comment: Project Page: https://drivedreamer4d.github.io
♻ ☆ Deep Multimodal Learning with Missing Modality: A Survey
During multimodal model training and testing, certain data modalities may be
absent due to sensor limitations, cost constraints, privacy concerns, or data
loss, negatively affecting performance. Multimodal learning techniques designed
to handle missing modalities can mitigate this by ensuring model robustness
even when some modalities are unavailable. This survey reviews recent progress
in Multimodal Learning with Missing Modality (MLMM), focusing on deep learning
methods. It provides the first comprehensive survey that covers the motivation
and distinctions between MLMM and standard multimodal learning setups, followed
by a detailed analysis of current methods, applications, and datasets,
concluding with challenges and future directions.
comment: Submitted to ACM Computing Surveys
♻ ☆ Shotluck Holmes: A Family of Efficient Small-Scale Large Language Vision Models For Video Captioning and Summarization
Video is an increasingly prominent and information-dense medium, yet it poses
substantial challenges for language models. A typical video consists of a
sequence of shorter segments, or shots, that collectively form a coherent
narrative. Each shot is analogous to a word in a sentence where multiple data
streams of information (such as visual and auditory data) must be processed
simultaneously. Comprehension of the entire video requires not only
understanding the visual-audio information of each shot but also requires that
the model links the ideas between each shot to generate a larger,
all-encompassing story. Despite significant progress in the field, current
works often overlook videos' more granular shot-by-shot semantic information.
In this project, we propose a family of efficient large language vision models
(LLVMs) to boost video summarization and captioning called Shotluck Holmes. By
leveraging better pretraining and data collection strategies, we extend the
abilities of existing small LLVMs from being able to understand a picture to
being able to understand a sequence of frames. Specifically, we show that
Shotluck Holmes achieves better performance than state-of-the-art results on
the Shot2Story video captioning and summary task with significantly smaller and
more computationally efficient models.
♻ ☆ LongVILA: Scaling Long-Context Visual Language Models for Long Videos
Fuzhao Xue, Yukang Chen, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, Ethan He, Hongxu Yin, Pavlo Molchanov, Jan Kautz, Linxi Fan, Yuke Zhu, Yao Lu, Song Han
Long-context capability is critical for multi-modal foundation models,
especially for long video understanding. We introduce LongVILA, a full-stack
solution for long-context visual-language models \qinghao{by co-designing the
algorithm and system. For model training, we upgrade existing VLMs to support
long video understanding by incorporating two additional stages, {\em i.e.},
long context extension and long video supervised fine-tuning. However, training
on long video is computationally and memory intensive. We introduce the
long-context Multi-Modal Sequence Parallelism (MM-SP) system that efficiently
parallelizes long video training and inference, enabling 2M context length
training on 256 GPUs without any gradient checkpointing. LongVILA efficiently
extends the number of video frames of VILA from 8 to 2048, improving the long
video captioning score from 2.00 to 3.26 (out of 5), achieving 99.8% accuracy
in 6,000-frame (more than 1 million tokens) video needle-in-a-haystack.
LongVILA-7B demonstrates strong accuracy on the VideoMME benchmark, i.e., 61.8%
with subtitle. Besides, MM-SP is 2.1x - 5.7x faster than ring style sequence
parallelism and 1.1x - 1.4x faster than Megatron with a hybrid context and
tensor parallelism. Moreover, it seamlessly integrates with Hugging Face
Transformers.
comment: Code and models are available at
https://github.com/NVlabs/VILA/blob/main/LongVILA.md
♻ ☆ Back-in-Time Diffusion: Unsupervised Detection of Medical Deepfakes
Recent progress in generative models has made it easier for a wide audience
to edit and create image content, raising concerns about the proliferation of
deepfakes, especially in healthcare. Despite the availability of numerous
techniques for detecting manipulated images captured by conventional cameras,
their applicability to medical images is limited. This limitation stems from
the distinctive forensic characteristics of medical images, a result of their
imaging process.
In this work we propose a novel anomaly detector for medical imagery based on
diffusion models. Normally, diffusion models are used to generate images.
However, we show how a similar process can be used to detect synthetic content
by making a model reverse the diffusion on a suspected image. We evaluate our
method on the task of detecting fake tumors injected and removed from CT and
MRI scans. Our method significantly outperforms other state of the art
unsupervised detectors with an increased AUC of 0.9 from 0.79 for injection and
of 0.96 from 0.91 for removal on average. We also explore our hypothesis using
AI explainability tools and publish our code and new medical deepfake datasets
to encourage further research into this domain.
♻ ☆ Motion Segmentation for Neuromorphic Aerial Surveillance
Aerial surveillance demands rapid and precise detection of moving objects in
dynamic environments. Event cameras, which draw inspiration from biological
vision systems, present a promising alternative to frame-based sensors due to
their exceptional temporal resolution, superior dynamic range, and minimal
power requirements. Unlike traditional frame-based sensors that capture
redundant information at fixed intervals, event cameras asynchronously record
pixel-level brightness changes, providing a continuous and efficient data
stream ideal for fast motion segmentation. While these sensors are ideal for
fast motion segmentation, existing event-based motion segmentation methods
often suffer from limitations such as the need for per-scene parameter tuning
or reliance on manual labelling, hindering their scalability and practical
deployment. In this paper, we address these challenges by introducing a novel
motion segmentation method that leverages self-supervised vision transformers
on both event data and optical flow information. Our approach eliminates the
need for human annotations and reduces dependency on scene-specific parameters.
In this paper, we used the EVK4-HD Prophesee event camera onboard a highly
dynamic aerial platform in urban settings. We conduct extensive evaluations of
our framework across multiple datasets, demonstrating state-of-the-art
performance compared to existing benchmarks. Our method can effectively handle
various types of motion and an arbitrary number of moving objects. Code and
dataset are available at: \url{https://samiarja.github.io/evairborne/}
comment: 17 pages, 11 figures, 8 tables
♻ ☆ You Only Sample Once: Taming One-Step Text-to-Image Synthesis by Self-Cooperative Diffusion GANs
Recently, some works have tried to combine diffusion and Generative
Adversarial Networks (GANs) to alleviate the computational cost of the
iterative denoising inference in Diffusion Models (DMs). However, existing
works in this line suffer from either training instability and mode collapse or
subpar one-step generation learning efficiency. To address these issues, we
introduce YOSO, a novel generative model designed for rapid, scalable, and
high-fidelity one-step image synthesis with high training stability and mode
coverage. Specifically, we smooth the adversarial divergence by the denoising
generator itself, performing self-cooperative learning. We show that our method
can serve as a one-step generation model training from scratch with competitive
performance. Moreover, we extend our YOSO to one-step text-to-image generation
based on pre-trained models by several effective training techniques (i.e.,
latent perceptual loss and latent discriminator for efficient training along
with the latent DMs; the informative prior initialization (IPI), and the quick
adaption stage for fixing the flawed noise scheduler). Experimental results
show that YOSO achieves the state-of-the-art one-step generation performance
even with Low-Rank Adaptation (LoRA) fine-tuning. In particular, we show that
the YOSO-PixArt-$\alpha$ can generate images in one step trained on 512
resolution, with the capability of adapting to 1024 resolution without extra
explicit training, requiring only ~10 A800 days for fine-tuning. Our code is
provided at https://github.com/Luo-Yihong/YOSO.
comment: Revision
♻ ☆ Enhanced Prompt-leveraged Weakly Supervised Cancer Segmentation based on Segment Anything
This work proposes a novel approach beyond supervised learning for effective
pathological image analysis, addressing the challenge of limited robust labeled
data. Pathological diagnosis of diseases like cancer has conventionally relied
on the evaluation of morphological features by physicians and pathologists.
However, recent advancements in compute-aided diagnosis (CAD) systems are
gaining significant attention as diagnostic support tools. Although the
advancement of deep learning has improved CAD significantly, segmentation
models typically require large pixel-level annotated dataset, and such labeling
is expensive. Existing studies not based on supervised approaches still
struggle with limited generalization, and no practical approach has emerged
yet. To address this issue, we present a weakly supervised semantic
segmentation (WSSS) model by combining class activation map and Segment
Anything Model (SAM)-based pseudo-labeling. For effective pretraining, we adopt
the SAM-a foundation model that is pretrained on large datasets and operates in
zero-shot configurations using only coarse prompts. The proposed approach
transfer enhanced Attention Dropout Layer's knowledge to SAM, thereby
generating pseudo-labels. To demonstrate the superiority of the proposed
method, experimental studies are conducted on histopathological breast cancer
datasets. The proposed method outperformed other WSSS methods across three
datasets, demonstrating its efficiency by achieving this with only 12GB of GPU
memory during training. Our code is available at :
https://github.com/QI-NemoSong/EPLC-SAM
comment: 10 pages, 7 figures
♻ ☆ Look, Listen, and Answer: Overcoming Biases for Audio-Visual Question Answering NeurIPS 2024
Audio-Visual Question Answering (AVQA) is a complex multi-modal reasoning
task, demanding intelligent systems to accurately respond to natural language
queries based on audio-video input pairs. Nevertheless, prevalent AVQA
approaches are prone to overlearning dataset biases, resulting in poor
robustness. Furthermore, current datasets may not provide a precise diagnostic
for these methods. To tackle these challenges, firstly, we propose a novel
dataset, MUSIC-AVQA-R, crafted in two steps: rephrasing questions within the
test split of a public dataset (MUSIC-AVQA) and subsequently introducing
distribution shifts to split questions. The former leads to a large, diverse
test space, while the latter results in a comprehensive robustness evaluation
on rare, frequent, and overall questions. Secondly, we propose a robust
architecture that utilizes a multifaceted cycle collaborative debiasing
strategy to overcome bias learning. Experimental results show that this
architecture achieves state-of-the-art performance on MUSIC-AVQA-R, notably
obtaining a significant improvement of 9.32%. Extensive ablation experiments
are conducted on the two datasets mentioned to analyze the component
effectiveness within the debiasing strategy. Additionally, we highlight the
limited robustness of existing multi-modal QA methods through the evaluation on
our dataset. We also conduct experiments combining various baselines with our
proposed strategy on two datasets to verify its plug-and-play capability. Our
dataset and code are available at https://github.com/reml-group/MUSIC-AVQA-R.
comment: Accepted by NeurIPS 2024
♻ ☆ NutrifyAI: An AI-Powered System for Real-Time Food Detection, Nutritional Analysis, and Personalized Meal Recommendations
With diet and nutrition apps reaching 1.4 billion users in 2022 [1], it's not
surprise that popular health apps, MyFitnessPal, Noom, and Calorie Counter, are
surging in popularity. However, one major setback [2] of nearly all nutrition
applications is that users must enter food data manually, which is
time-consuming and tedious. Thus, there has been an increasing demand for
applications that can accurately identify food items, analyze their nutritional
content, and offer dietary recommendations in real-time. This paper introduces
a comprehensive system that combines advanced computer vision techniques with
nutritional analysis, implemented in a versatile mobile and web application.
The system is divided into three key concepts: 1) food detection using the
YOLOv8 model, 2) nutrient analysis via the Edamam Nutrition Analysis API, and
3) personalized meal recommendations using the Edamam Meal Planning and Recipe
Search APIs. Preliminary results showcase the system's effectiveness by
providing immediate, accurate dietary insights, with a demonstrated food
recognition accuracy of nearly 80%, making it a valuable tool for users to make
informed dietary decisions.
comment: 4 pages, 8 figures
♻ ☆ HiRT: Enhancing Robotic Control with Hierarchical Robot Transformers
Large Vision-Language-Action (VLA) models, leveraging powerful pre trained
Vision-Language Models (VLMs) backends, have shown promise in robotic control
due to their impressive generalization ability. However, the success comes at a
cost. Their reliance on VLM backends with billions of parameters leads to high
computational costs and inference latency, limiting the testing scenarios to
mainly quasi-static tasks and hindering performance in dynamic tasks requiring
rapid interactions. To address these limitations, this paper proposes HiRT, a
Hierarchical Robot Transformer framework that enables flexible frequency and
performance trade-off. HiRT keeps VLMs running at low frequencies to capture
temporarily invariant features while enabling real-time interaction through a
high-frequency vision-based policy guided by the slowly updated features.
Experiment results in both simulation and real-world settings demonstrate
significant improvements over baseline methods. Empirically, in static tasks,
we double the control frequency and achieve comparable success rates.
Additionally, on novel real-world dynamic ma nipulation tasks which are
challenging for previous VLA models, HiRT improves the success rate from 48% to
75%.
♻ ☆ PointSeg: A Training-Free Paradigm for 3D Scene Segmentation via Foundation Models
Qingdong He, Jinlong Peng, Zhengkai Jiang, Xiaobin Hu, Jiangning Zhang, Qiang Nie, Yabiao Wang, Chengjie Wang
Recent success of vision foundation models have shown promising performance
for the 2D perception tasks. However, it is difficult to train a 3D foundation
network directly due to the limited dataset and it remains under explored
whether existing foundation models can be lifted to 3D space seamlessly. In
this paper, we present PointSeg, a novel training-free paradigm that leverages
off-the-shelf vision foundation models to address 3D scene perception tasks.
PointSeg can segment anything in 3D scene by acquiring accurate 3D prompts to
align their corresponding pixels across frames. Concretely, we design a
two-branch prompts learning structure to construct the 3D point-box prompts
pairs, combining with the bidirectional matching strategy for accurate point
and proposal prompts generation. Then, we perform the iterative post-refinement
adaptively when cooperated with different vision foundation models. Moreover,
we design a affinity-aware merging algorithm to improve the final ensemble
masks. PointSeg demonstrates impressive segmentation performance across various
datasets, all without training. Specifically, our approach significantly
surpasses the state-of-the-art specialist training-free model by 14.1$\%$,
12.3$\%$, and 12.6$\%$ mAP on ScanNet, ScanNet++, and KITTI-360 datasets,
respectively. On top of that, PointSeg can incorporate with various foundation
models and even surpasses the specialist training-based methods by
3.4$\%$-5.4$\%$ mAP across various datasets, serving as an effective generalist
model.
♻ ☆ LiteVLoc: Map-Lite Visual Localization for Image Goal Navigation
Jianhao Jiao, Jinhao He, Changkun Liu, Sebastian Aegidius, Xiangcheng Hu, Tristan Braud, Dimitrios Kanoulas
This paper presents LiteVLoc, a hierarchical visual localization framework
that uses a lightweight topo-metric map to represent the environment. The
method consists of three sequential modules that estimate camera poses in a
coarse-to-fine manner. Unlike mainstream approaches relying on detailed 3D
representations, LiteVLoc reduces storage overhead by leveraging learning-based
feature matching and geometric solvers for metric pose estimation. A novel
dataset for the map-free relocalization task is also introduced. Extensive
experiments including localization and navigation in both simulated and
real-world scenarios have validate the system's performance and demonstrated
its precision and efficiency for large-scale deployment. Code and data will be
made publicly available.
comment: 9 pages, 4 figures
♻ ☆ Cardiac Copilot: Automatic Probe Guidance for Echocardiography with World Model MICCAI2024
Echocardiography is the only technique capable of real-time imaging of the
heart and is vital for diagnosing the majority of cardiac diseases. However,
there is a severe shortage of experienced cardiac sonographers, due to the
heart's complex structure and significant operational challenges. To mitigate
this situation, we present a Cardiac Copilot system capable of providing
real-time probe movement guidance to assist less experienced sonographers in
conducting freehand echocardiography. This system can enable non-experts,
especially in primary departments and medically underserved areas, to perform
cardiac ultrasound examinations, potentially improving global healthcare
delivery. The core innovation lies in proposing a data-driven world model,
named Cardiac Dreamer, for representing cardiac spatial structures. This world
model can provide structure features of any cardiac planes around the current
probe position in the latent space, serving as an precise navigation map for
autonomous plane localization. We train our model with real-world ultrasound
data and corresponding probe motion from 110 routine clinical scans with 151K
sample pairs by three certified sonographers. Evaluations on three standard
planes with 37K sample pairs demonstrate that the world model can reduce
navigation errors by up to 33\% and exhibit more stable performance.
comment: Accepted by MICCAI2024
♻ ☆ A Rainbow in Deep Network Black Boxes
A central question in deep learning is to understand the functions learned by
deep networks. What is their approximation class? Do the learned weights and
representations depend on initialization? Previous empirical work has evidenced
that kernels defined by network activations are similar across initializations.
For shallow networks, this has been theoretically studied with random feature
models, but an extension to deep networks has remained elusive. Here, we
provide a deep extension of such random feature models, which we call the
rainbow model. We prove that rainbow networks define deterministic
(hierarchical) kernels in the infinite-width limit. The resulting functions
thus belong to a data-dependent RKHS which does not depend on the weight
randomness. We also verify numerically our modeling assumptions on deep CNNs
trained on image classification tasks, and show that the trained networks
approximately satisfy the rainbow hypothesis. In particular, rainbow networks
sampled from the corresponding random feature model achieve similar performance
as the trained networks. Our results highlight the central role played by the
covariances of network weights at each layer, which are observed to be low-rank
as a result of feature learning.
comment: 59 pages, 10 figures. To appear at JMLR
♻ ☆ FSL-Rectifier: Rectify Outliers in Few-Shot Learning via Test-Time Augmentation
Few-shot-learning (FSL) commonly requires a model to identify images
(queries) that belong to classes unseen during training, based on a few labeled
samples of the new classes (support set) as reference. So far, plenty of
algorithms involve training data augmentation to improve the generalization
capability of FSL models, but outlier queries or support images during
inference can still pose great generalization challenges. In this work, to
reduce the bias caused by the outlier samples, we generate additional
test-class samples by combining original samples with suitable train-class
samples via a generative image combiner. Then, we obtain averaged features via
an augmentor, which leads to more typical representations through the
averaging. We experimentally and theoretically demonstrate the effectiveness of
our method, e.g., obtaining a test accuracy improvement proportion of around
10% (e.g., from 46.86% to 53.28%) for trained FSL models. Importantly, given
pretrained image combiner, our method is training-free for off-the-shelf FSL
models, whose performance can be improved without extra datasets nor further
training of the models themselves.
♻ ☆ GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI
Pengcheng Chen, Jin Ye, Guoan Wang, Yanjun Li, Zhongying Deng, Wei Li, Tianbin Li, Haodong Duan, Ziyan Huang, Yanzhou Su, Benyou Wang, Shaoting Zhang, Bin Fu, Jianfei Cai, Bohan Zhuang, Eric J Seibel, Junjun He, Yu Qiao
Large Vision-Language Models (LVLMs) are capable of handling diverse data
types such as imaging, text, and physiological signals, and can be applied in
various fields. In the medical field, LVLMs have a high potential to offer
substantial assistance for diagnosis and treatment. Before that, it is crucial
to develop benchmarks to evaluate LVLMs' effectiveness in various medical
applications. Current benchmarks are often built upon specific academic
literature, mainly focusing on a single domain, and lacking varying perceptual
granularities. Thus, they face specific challenges, including limited clinical
relevance, incomplete evaluations, and insufficient guidance for interactive
LVLMs. To address these limitations, we developed the GMAI-MMBench, the most
comprehensive general medical AI benchmark with well-categorized data structure
and multi-perceptual granularity to date. It is constructed from 284 datasets
across 38 medical image modalities, 18 clinical-related tasks, 18 departments,
and 4 perceptual granularities in a Visual Question Answering (VQA) format.
Additionally, we implemented a lexical tree structure that allows users to
customize evaluation tasks, accommodating various assessment needs and
substantially supporting medical AI research and applications. We evaluated 50
LVLMs, and the results show that even the advanced GPT-4o only achieves an
accuracy of 53.96%, indicating significant room for improvement. Moreover, we
identified five key insufficiencies in current cutting-edge LVLMs that need to
be addressed to advance the development of better medical applications. We
believe that GMAI-MMBench will stimulate the community to build the next
generation of LVLMs toward GMAI.
comment: GitHub: https://github.com/uni-medical/GMAI-MMBench Hugging face:
https://huggingface.co/datasets/OpenGVLab/GMAI-MMBench
♻ ☆ Open-World Continual Learning: Unifying Novelty Detection and Continual Learning
As AI agents are increasingly used in the real open world with unknowns or
novelties, they need the ability to (1) recognize objects that (a) they have
learned before and (b) detect items that they have never seen or learned, and
(2) learn the new items incrementally to become more and more knowledgeable and
powerful. (1) is called novelty detection or out-of-distribution (OOD)
detection and (2) is called class incremental learning (CIL), which is a
setting of continual learning (CL). In existing research, OOD detection and CIL
are regarded as two completely different problems. This paper first provides a
theoretical proof that good OOD detection for each task within the set of
learned tasks (called closed-world OOD detection) is necessary for successful
CIL. We show this by decomposing CIL into two sub-problems: within-task
prediction (WP) and task-id prediction (TP), and proving that TP is correlated
with closed-world OOD detection. The key theoretical result is that regardless
of whether WP and OOD detection (or TP) are defined explicitly or implicitly by
a CIL algorithm, good WP and good closed-world OOD detection are necessary and
sufficient conditions for good CIL, which unifies novelty or OOD detection and
continual learning (CIL, in particular). We call this traditional CIL the
closed-world CIL as it does not detect future OOD data in the open world. The
paper then proves that the theory can be generalized or extended to open-world
CIL, which is the proposed open-world continual learning, that can perform CIL
in the open world and detect future or open-world OOD data. Based on the
theoretical results, new CIL methods are also designed, which outperform strong
baselines in CIL accuracy and in continual OOD detection by a large margin.
comment: To appear in Artificial Intelligence Journal. arXiv admin note:
substantial text overlap with arXiv:2211.02633
♻ ☆ PIR: Remote Sensing Image-Text Retrieval with Prior Instruction Representation Learning
Remote sensing image-text retrieval constitutes a foundational aspect of
remote sensing interpretation tasks, facilitating the alignment of vision and
language representations. This paper introduces a prior instruction
representation (PIR) learning paradigm that draws on prior knowledge to
instruct adaptive learning of vision and text representations. Based on PIR, a
domain-adapted remote sensing image-text retrieval framework PIR-ITR is
designed to address semantic noise issues in vision-language understanding
tasks. However, with massive additional data for pre-training the
vision-language foundation model, remote sensing image-text retrieval is
further developed into an open-domain retrieval task. Continuing with the
above, we propose PIR-CLIP, a domain-specific CLIP-based framework for remote
sensing image-text retrieval, to address semantic noise in remote sensing
vision-language representations and further improve open-domain retrieval
performance. In vision representation, we utilize the prior-guided knowledge of
the remote sensing scene recognition by building a belief matrix to select key
features for reducing the impact of semantic noise. In text representation, we
use the previous time step to cyclically activate the current time step to
enhance text representation capability. A cluster-wise Affiliation Loss (AL) is
proposed to constrain the inter-classes and to reduce the semantic confusion
zones in the common subspace. Comprehensive experiments demonstrate that PIR
could enhance vision and text representations and outperform the
state-of-the-art methods of closed-domain and open-domain retrieval on two
benchmark datasets, RSICD and RSITMD.
comment: 13 pages, 8 figures
♻ ☆ MAL: Motion-Aware Loss with Temporal and Distillation Hints for Self-Supervised Depth Estimation ICRA 2024
Depth perception is crucial for a wide range of robotic applications.
Multi-frame self-supervised depth estimation methods have gained research
interest due to their ability to leverage large-scale, unlabeled real-world
data. However, the self-supervised methods often rely on the assumption of a
static scene and their performance tends to degrade in dynamic environments. To
address this issue, we present Motion-Aware Loss, which leverages the temporal
relation among consecutive input frames and a novel distillation scheme between
the teacher and student networks in the multi-frame self-supervised depth
estimation methods. Specifically, we associate the spatial locations of moving
objects with the temporal order of input frames to eliminate errors induced by
object motion. Meanwhile, we enhance the original distillation scheme in
multi-frame methods to better exploit the knowledge from a teacher network. MAL
is a novel, plug-and-play module designed for seamless integration into
multi-frame self-supervised monocular depth estimation methods. Adding MAL into
previous state-of-the-art methods leads to a reduction in depth estimation
errors by up to 4.2% and 10.8% on KITTI and CityScapes benchmarks,
respectively.
comment: Accepted by ICRA 2024; Project homepage:
https://yuejiangdong.github.io/MotionAwareLoss/
♻ ☆ End-to-End Rate-Distortion Optimized 3D Gaussian Representation ECCV 2024
3D Gaussian Splatting (3DGS) has become an emerging technique with remarkable
potential in 3D representation and image rendering. However, the substantial
storage overhead of 3DGS significantly impedes its practical applications. In
this work, we formulate the compact 3D Gaussian learning as an end-to-end
Rate-Distortion Optimization (RDO) problem and propose RDO-Gaussian that can
achieve flexible and continuous rate control. RDO-Gaussian addresses two main
issues that exist in current schemes: 1) Different from prior endeavors that
minimize the rate under the fixed distortion, we introduce dynamic pruning and
entropy-constrained vector quantization (ECVQ) that optimize the rate and
distortion at the same time. 2) Previous works treat the colors of each
Gaussian equally, while we model the colors of different regions and materials
with learnable numbers of parameters. We verify our method on both real and
synthetic scenes, showcasing that RDO-Gaussian greatly reduces the size of 3D
Gaussian over 40x, and surpasses existing methods in rate-distortion
performance.
comment: ECCV 2024
♻ ☆ CinePile: A Long Video Question Answering Dataset and Benchmark
Ruchit Rawal, Khalid Saifullah, Miquel Farré, Ronen Basri, David Jacobs, Gowthami Somepalli, Tom Goldstein
Current datasets for long-form video understanding often fall short of
providing genuine long-form comprehension challenges, as many tasks derived
from these datasets can be successfully tackled by analyzing just one or a few
random frames from a video. To address this issue, we present a novel dataset
and benchmark, CinePile, specifically designed for authentic long-form video
understanding. This paper details our innovative approach for creating a
question-answer dataset, utilizing advanced LLMs with human-in-the-loop and
building upon human-generated raw data. Our comprehensive dataset comprises
305,000 multiple-choice questions (MCQs), covering various visual and
multimodal aspects, including temporal comprehension, understanding
human-object interactions, and reasoning about events or actions within a
scene. Additionally, we fine-tuned open-source Video-LLMs on the training split
and evaluated both open-source and proprietary video-centric LLMs on the test
split of our dataset. The findings indicate that although current models
underperform compared to humans, fine-tuning these models can lead to
significant improvements in their performance.
comment: Project page with all the artifacts -
https://ruchitrawal.github.io/cinepile/. Updated version with adversarial
refinement pipeline and more model evaluations
♻ ★ Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, Mike Zheng Shou
We present a unified transformer, i.e., Show-o, that unifies multimodal
understanding and generation. Unlike fully autoregressive models, Show-o
unifies autoregressive and (discrete) diffusion modeling to adaptively handle
inputs and outputs of various and mixed modalities. The unified model flexibly
supports a wide range of vision-language tasks including visual
question-answering, text-to-image generation, text-guided
inpainting/extrapolation, and mixed-modality generation. Across various
benchmarks, it demonstrates comparable or superior performance to existing
individual models with an equivalent or larger number of parameters tailored
for understanding or generation. This significantly highlights its potential as
a next-generation foundation model. Code and models are released at
https://github.com/showlab/Show-o.
comment: Technical Report